docs migration from openstack-manuals
Context for this is at https://specs.openstack.org/openstack/docs-specs/specs/pike/os-manuals-migration.html Change-Id: I9a4da27ce1d56b6406e2db979698038488f3cf6f
BIN
doc/source/admin/figures/objectstorage-accountscontainers.png
Normal file
After Width: | Height: | Size: 32 KiB |
BIN
doc/source/admin/figures/objectstorage-arch.png
Normal file
After Width: | Height: | Size: 56 KiB |
BIN
doc/source/admin/figures/objectstorage-buildingblocks.png
Normal file
After Width: | Height: | Size: 48 KiB |
BIN
doc/source/admin/figures/objectstorage-nodes.png
Normal file
After Width: | Height: | Size: 58 KiB |
BIN
doc/source/admin/figures/objectstorage-partitions.png
Normal file
After Width: | Height: | Size: 28 KiB |
BIN
doc/source/admin/figures/objectstorage-replication.png
Normal file
After Width: | Height: | Size: 45 KiB |
BIN
doc/source/admin/figures/objectstorage-ring.png
Normal file
After Width: | Height: | Size: 23 KiB |
BIN
doc/source/admin/figures/objectstorage-usecase.png
Normal file
After Width: | Height: | Size: 61 KiB |
BIN
doc/source/admin/figures/objectstorage-zones.png
Normal file
After Width: | Height: | Size: 10 KiB |
BIN
doc/source/admin/figures/objectstorage.png
Normal file
After Width: | Height: | Size: 23 KiB |
22
doc/source/admin/index.rst
Normal file
@ -0,0 +1,22 @@
|
|||||||
|
===================================
|
||||||
|
OpenStack Swift Administrator Guide
|
||||||
|
===================================
|
||||||
|
|
||||||
|
.. toctree::
|
||||||
|
:maxdepth: 2
|
||||||
|
|
||||||
|
objectstorage-intro.rst
|
||||||
|
objectstorage-features.rst
|
||||||
|
objectstorage-characteristics.rst
|
||||||
|
objectstorage-components.rst
|
||||||
|
objectstorage-ringbuilder.rst
|
||||||
|
objectstorage-arch.rst
|
||||||
|
objectstorage-replication.rst
|
||||||
|
objectstorage-large-objects.rst
|
||||||
|
objectstorage-auditors.rst
|
||||||
|
objectstorage-EC.rst
|
||||||
|
objectstorage-account-reaper.rst
|
||||||
|
objectstorage-tenant-specific-image-storage.rst
|
||||||
|
objectstorage-monitoring.rst
|
||||||
|
objectstorage-admin.rst
|
||||||
|
objectstorage-troubleshoot.rst
|
31
doc/source/admin/objectstorage-EC.rst
Normal file
@ -0,0 +1,31 @@
|
|||||||
|
==============
|
||||||
|
Erasure coding
|
||||||
|
==============
|
||||||
|
|
||||||
|
Erasure coding is a set of algorithms that allows the reconstruction of
|
||||||
|
missing data from a set of original data. In theory, erasure coding uses
|
||||||
|
less capacity with similar durability characteristics as replicas.
|
||||||
|
From an application perspective, erasure coding support is transparent.
|
||||||
|
Object Storage (swift) implements erasure coding as a Storage Policy.
|
||||||
|
See `Storage Policies
|
||||||
|
<https://docs.openstack.org/developer/swift/overview_policies.html>`_
|
||||||
|
for more details.
|
||||||
|
|
||||||
|
There is no external API related to erasure coding. Create a container using a
|
||||||
|
Storage Policy; the interaction with the cluster is the same as any
|
||||||
|
other durability policy. Because support implements as a Storage Policy,
|
||||||
|
you can isolate all storage devices that associate with your cluster's
|
||||||
|
erasure coding capability. It is entirely possible to share devices between
|
||||||
|
storage policies, but for erasure coding it may make more sense to use
|
||||||
|
not only separate devices but possibly even entire nodes dedicated for erasure
|
||||||
|
coding.
|
||||||
|
|
||||||
|
.. important::
|
||||||
|
|
||||||
|
The erasure code support in Object Storage is considered beta in Kilo.
|
||||||
|
Most major functionality is included, but it has not been tested or
|
||||||
|
validated at large scale. This feature relies on ``ssync`` for durability.
|
||||||
|
We recommend deployers do extensive testing and not deploy production
|
||||||
|
data using an erasure code storage policy.
|
||||||
|
If any bugs are found during testing, please report them to
|
||||||
|
https://bugs.launchpad.net/swift
|
51
doc/source/admin/objectstorage-account-reaper.rst
Normal file
@ -0,0 +1,51 @@
|
|||||||
|
==============
|
||||||
|
Account reaper
|
||||||
|
==============
|
||||||
|
|
||||||
|
The purpose of the account reaper is to remove data from the deleted accounts.
|
||||||
|
|
||||||
|
A reseller marks an account for deletion by issuing a ``DELETE`` request
|
||||||
|
on the account's storage URL. This action sets the ``status`` column of
|
||||||
|
the account_stat table in the account database and replicas to
|
||||||
|
``DELETED``, marking the account's data for deletion.
|
||||||
|
|
||||||
|
Typically, a specific retention time or undelete are not provided.
|
||||||
|
However, you can set a ``delay_reaping`` value in the
|
||||||
|
``[account-reaper]`` section of the ``account-server.conf`` file to
|
||||||
|
delay the actual deletion of data. At this time, to undelete you have to update
|
||||||
|
the account database replicas directly, set the status column to an
|
||||||
|
empty string and update the put_timestamp to be greater than the
|
||||||
|
delete_timestamp.
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
It is on the development to-do list to write a utility that performs
|
||||||
|
this task, preferably through a REST call.
|
||||||
|
|
||||||
|
The account reaper runs on each account server and scans the server
|
||||||
|
occasionally for account databases marked for deletion. It only fires up
|
||||||
|
on the accounts for which the server is the primary node, so that
|
||||||
|
multiple account servers aren't trying to do it simultaneously. Using
|
||||||
|
multiple servers to delete one account might improve the deletion speed
|
||||||
|
but requires coordination to avoid duplication. Speed really is not a
|
||||||
|
big concern with data deletion, and large accounts aren't deleted often.
|
||||||
|
|
||||||
|
Deleting an account is simple. For each account container, all objects
|
||||||
|
are deleted and then the container is deleted. Deletion requests that
|
||||||
|
fail will not stop the overall process but will cause the overall
|
||||||
|
process to fail eventually (for example, if an object delete times out,
|
||||||
|
you will not be able to delete the container or the account). The
|
||||||
|
account reaper keeps trying to delete an account until it is empty, at
|
||||||
|
which point the database reclaim process within the db\_replicator will
|
||||||
|
remove the database files.
|
||||||
|
|
||||||
|
A persistent error state may prevent the deletion of an object or
|
||||||
|
container. If this happens, you will see a message in the log, for example:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
Account <name> has not been reaped since <date>
|
||||||
|
|
||||||
|
You can control when this is logged with the ``reap_warn_after`` value in the
|
||||||
|
``[account-reaper]`` section of the ``account-server.conf`` file.
|
||||||
|
The default value is 30 days.
|
11
doc/source/admin/objectstorage-admin.rst
Normal file
@ -0,0 +1,11 @@
|
|||||||
|
========================================
|
||||||
|
System administration for Object Storage
|
||||||
|
========================================
|
||||||
|
|
||||||
|
By understanding Object Storage concepts, you can better monitor and
|
||||||
|
administer your storage solution. The majority of the administration
|
||||||
|
information is maintained in developer documentation at
|
||||||
|
`docs.openstack.org/developer/swift/ <https://docs.openstack.org/developer/swift/>`__.
|
||||||
|
|
||||||
|
See the `OpenStack Configuration Reference <https://docs.openstack.org/ocata/config-reference/object-storage.html>`__
|
||||||
|
for a list of configuration options for Object Storage.
|
88
doc/source/admin/objectstorage-arch.rst
Normal file
@ -0,0 +1,88 @@
|
|||||||
|
====================
|
||||||
|
Cluster architecture
|
||||||
|
====================
|
||||||
|
|
||||||
|
Access tier
|
||||||
|
~~~~~~~~~~~
|
||||||
|
Large-scale deployments segment off an access tier, which is considered
|
||||||
|
the Object Storage system's central hub. The access tier fields the
|
||||||
|
incoming API requests from clients and moves data in and out of the
|
||||||
|
system. This tier consists of front-end load balancers, ssl-terminators,
|
||||||
|
and authentication services. It runs the (distributed) brain of the
|
||||||
|
Object Storage system: the proxy server processes.
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
If you want to use OpenStack Identity API v3 for authentication, you
|
||||||
|
have the following options available in ``/etc/swift/dispersion.conf``:
|
||||||
|
``auth_version``, ``user_domain_name``, ``project_domain_name``,
|
||||||
|
and ``project_name``.
|
||||||
|
|
||||||
|
**Object Storage architecture**
|
||||||
|
|
||||||
|
|
||||||
|
.. figure:: figures/objectstorage-arch.png
|
||||||
|
|
||||||
|
|
||||||
|
Because access servers are collocated in their own tier, you can scale
|
||||||
|
out read/write access regardless of the storage capacity. For example,
|
||||||
|
if a cluster is on the public Internet, requires SSL termination, and
|
||||||
|
has a high demand for data access, you can provision many access
|
||||||
|
servers. However, if the cluster is on a private network and used
|
||||||
|
primarily for archival purposes, you need fewer access servers.
|
||||||
|
|
||||||
|
Since this is an HTTP addressable storage service, you may incorporate a
|
||||||
|
load balancer into the access tier.
|
||||||
|
|
||||||
|
Typically, the tier consists of a collection of 1U servers. These
|
||||||
|
machines use a moderate amount of RAM and are network I/O intensive.
|
||||||
|
Since these systems field each incoming API request, you should
|
||||||
|
provision them with two high-throughput (10GbE) interfaces - one for the
|
||||||
|
incoming ``front-end`` requests and the other for the ``back-end`` access to
|
||||||
|
the object storage nodes to put and fetch data.
|
||||||
|
|
||||||
|
Factors to consider
|
||||||
|
-------------------
|
||||||
|
|
||||||
|
For most publicly facing deployments as well as private deployments
|
||||||
|
available across a wide-reaching corporate network, you use SSL to
|
||||||
|
encrypt traffic to the client. SSL adds significant processing load to
|
||||||
|
establish sessions between clients, which is why you have to provision
|
||||||
|
more capacity in the access layer. SSL may not be required for private
|
||||||
|
deployments on trusted networks.
|
||||||
|
|
||||||
|
Storage nodes
|
||||||
|
~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
In most configurations, each of the five zones should have an equal
|
||||||
|
amount of storage capacity. Storage nodes use a reasonable amount of
|
||||||
|
memory and CPU. Metadata needs to be readily available to return objects
|
||||||
|
quickly. The object stores run services not only to field incoming
|
||||||
|
requests from the access tier, but to also run replicators, auditors,
|
||||||
|
and reapers. You can provision object stores provisioned with single
|
||||||
|
gigabit or 10 gigabit network interface depending on the expected
|
||||||
|
workload and desired performance.
|
||||||
|
|
||||||
|
**Object Storage (swift)**
|
||||||
|
|
||||||
|
|
||||||
|
.. figure:: figures/objectstorage-nodes.png
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
Currently, a 2 TB or 3 TB SATA disk delivers good performance for the
|
||||||
|
price. You can use desktop-grade drives if you have responsive remote
|
||||||
|
hands in the datacenter and enterprise-grade drives if you don't.
|
||||||
|
|
||||||
|
Factors to consider
|
||||||
|
-------------------
|
||||||
|
|
||||||
|
You should keep in mind the desired I/O performance for single-threaded
|
||||||
|
requests. This system does not use RAID, so a single disk handles each
|
||||||
|
request for an object. Disk performance impacts single-threaded response
|
||||||
|
rates.
|
||||||
|
|
||||||
|
To achieve apparent higher throughput, the object storage system is
|
||||||
|
designed to handle concurrent uploads/downloads. The network I/O
|
||||||
|
capacity (1GbE, bonded 1GbE pair, or 10GbE) should match your desired
|
||||||
|
concurrent throughput needs for reads and writes.
|
30
doc/source/admin/objectstorage-auditors.rst
Normal file
@ -0,0 +1,30 @@
|
|||||||
|
==============
|
||||||
|
Object Auditor
|
||||||
|
==============
|
||||||
|
|
||||||
|
On system failures, the XFS file system can sometimes truncate files it is
|
||||||
|
trying to write and produce zero-byte files. The object-auditor will catch
|
||||||
|
these problems but in the case of a system crash it is advisable to run
|
||||||
|
an extra, less rate limited sweep, to check for these specific files.
|
||||||
|
You can run this command as follows:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
$ swift-object-auditor /path/to/object-server/config/file.conf once -z 1000
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
"-z" means to only check for zero-byte files at 1000 files per second.
|
||||||
|
|
||||||
|
It is useful to run the object auditor on a specific device or set of devices.
|
||||||
|
You can run the object-auditor once as follows:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
$ swift-object-auditor /path/to/object-server/config/file.conf once \
|
||||||
|
--devices=sda,sdb
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
This will run the object auditor on only the ``sda`` and ``sdb`` devices.
|
||||||
|
This parameter accepts a comma-separated list of values.
|
43
doc/source/admin/objectstorage-characteristics.rst
Normal file
@ -0,0 +1,43 @@
|
|||||||
|
==============================
|
||||||
|
Object Storage characteristics
|
||||||
|
==============================
|
||||||
|
|
||||||
|
The key characteristics of Object Storage are that:
|
||||||
|
|
||||||
|
- All objects stored in Object Storage have a URL.
|
||||||
|
|
||||||
|
- All objects stored are replicated 3✕ in as-unique-as-possible zones,
|
||||||
|
which can be defined as a group of drives, a node, a rack, and so on.
|
||||||
|
|
||||||
|
- All objects have their own metadata.
|
||||||
|
|
||||||
|
- Developers interact with the object storage system through a RESTful
|
||||||
|
HTTP API.
|
||||||
|
|
||||||
|
- Object data can be located anywhere in the cluster.
|
||||||
|
|
||||||
|
- The cluster scales by adding additional nodes without sacrificing
|
||||||
|
performance, which allows a more cost-effective linear storage
|
||||||
|
expansion than fork-lift upgrades.
|
||||||
|
|
||||||
|
- Data does not have to be migrated to an entirely new storage system.
|
||||||
|
|
||||||
|
- New nodes can be added to the cluster without downtime.
|
||||||
|
|
||||||
|
- Failed nodes and disks can be swapped out without downtime.
|
||||||
|
|
||||||
|
- It runs on industry-standard hardware, such as Dell, HP, and
|
||||||
|
Supermicro.
|
||||||
|
|
||||||
|
.. _objectstorage-figure:
|
||||||
|
|
||||||
|
Object Storage (swift)
|
||||||
|
|
||||||
|
.. figure:: figures/objectstorage.png
|
||||||
|
|
||||||
|
Developers can either write directly to the Swift API or use one of the
|
||||||
|
many client libraries that exist for all of the popular programming
|
||||||
|
languages, such as Java, Python, Ruby, and C#. Amazon S3 and RackSpace
|
||||||
|
Cloud Files users should be very familiar with Object Storage. Users new
|
||||||
|
to object storage systems will have to adjust to a different approach
|
||||||
|
and mindset than those required for a traditional filesystem.
|
258
doc/source/admin/objectstorage-components.rst
Normal file
@ -0,0 +1,258 @@
|
|||||||
|
==========
|
||||||
|
Components
|
||||||
|
==========
|
||||||
|
|
||||||
|
Object Storage uses the following components to deliver high
|
||||||
|
availability, high durability, and high concurrency:
|
||||||
|
|
||||||
|
- **Proxy servers** - Handle all of the incoming API requests.
|
||||||
|
|
||||||
|
- **Rings** - Map logical names of data to locations on particular
|
||||||
|
disks.
|
||||||
|
|
||||||
|
- **Zones** - Isolate data from other zones. A failure in one zone
|
||||||
|
does not impact the rest of the cluster as data replicates
|
||||||
|
across zones.
|
||||||
|
|
||||||
|
- **Accounts and containers** - Each account and container are
|
||||||
|
individual databases that are distributed across the cluster. An
|
||||||
|
account database contains the list of containers in that account. A
|
||||||
|
container database contains the list of objects in that container.
|
||||||
|
|
||||||
|
- **Objects** - The data itself.
|
||||||
|
|
||||||
|
- **Partitions** - A partition stores objects, account databases, and
|
||||||
|
container databases and helps manage locations where data lives in
|
||||||
|
the cluster.
|
||||||
|
|
||||||
|
|
||||||
|
.. _objectstorage-building-blocks-figure:
|
||||||
|
|
||||||
|
**Object Storage building blocks**
|
||||||
|
|
||||||
|
.. figure:: figures/objectstorage-buildingblocks.png
|
||||||
|
|
||||||
|
|
||||||
|
Proxy servers
|
||||||
|
-------------
|
||||||
|
|
||||||
|
Proxy servers are the public face of Object Storage and handle all of
|
||||||
|
the incoming API requests. Once a proxy server receives a request, it
|
||||||
|
determines the storage node based on the object's URL, for example:
|
||||||
|
https://swift.example.com/v1/account/container/object. Proxy servers
|
||||||
|
also coordinate responses, handle failures, and coordinate timestamps.
|
||||||
|
|
||||||
|
Proxy servers use a shared-nothing architecture and can be scaled as
|
||||||
|
needed based on projected workloads. A minimum of two proxy servers
|
||||||
|
should be deployed for redundancy. If one proxy server fails, the others
|
||||||
|
take over.
|
||||||
|
|
||||||
|
For more information concerning proxy server configuration, see
|
||||||
|
`Configuration Reference
|
||||||
|
<https://docs.openstack.org/ocata/config-reference/object-storage/proxy-server.html>`_.
|
||||||
|
|
||||||
|
Rings
|
||||||
|
-----
|
||||||
|
|
||||||
|
A ring represents a mapping between the names of entities stored on disks
|
||||||
|
and their physical locations. There are separate rings for accounts,
|
||||||
|
containers, and objects. When other components need to perform any
|
||||||
|
operation on an object, container, or account, they need to interact
|
||||||
|
with the appropriate ring to determine their location in the cluster.
|
||||||
|
|
||||||
|
The ring maintains this mapping using zones, devices, partitions, and
|
||||||
|
replicas. Each partition in the ring is replicated, by default, three
|
||||||
|
times across the cluster, and partition locations are stored in the
|
||||||
|
mapping maintained by the ring. The ring is also responsible for
|
||||||
|
determining which devices are used for handoff in failure scenarios.
|
||||||
|
|
||||||
|
Data can be isolated into zones in the ring. Each partition replica is
|
||||||
|
guaranteed to reside in a different zone. A zone could represent a
|
||||||
|
drive, a server, a cabinet, a switch, or even a data center.
|
||||||
|
|
||||||
|
The partitions of the ring are equally divided among all of the devices
|
||||||
|
in the Object Storage installation. When partitions need to be moved
|
||||||
|
around (for example, if a device is added to the cluster), the ring
|
||||||
|
ensures that a minimum number of partitions are moved at a time, and
|
||||||
|
only one replica of a partition is moved at a time.
|
||||||
|
|
||||||
|
You can use weights to balance the distribution of partitions on drives
|
||||||
|
across the cluster. This can be useful, for example, when differently
|
||||||
|
sized drives are used in a cluster.
|
||||||
|
|
||||||
|
The ring is used by the proxy server and several background processes
|
||||||
|
(like replication).
|
||||||
|
|
||||||
|
|
||||||
|
.. _objectstorage-ring-figure:
|
||||||
|
|
||||||
|
**The ring**
|
||||||
|
|
||||||
|
.. figure:: figures/objectstorage-ring.png
|
||||||
|
|
||||||
|
These rings are externally managed. The server processes themselves
|
||||||
|
do not modify the rings, they are instead given new rings modified by
|
||||||
|
other tools.
|
||||||
|
|
||||||
|
The ring uses a configurable number of bits from an ``MD5`` hash for a path
|
||||||
|
as a partition index that designates a device. The number of bits kept
|
||||||
|
from the hash is known as the partition power, and 2 to the partition
|
||||||
|
power indicates the partition count. Partitioning the full ``MD5`` hash ring
|
||||||
|
allows other parts of the cluster to work in batches of items at once
|
||||||
|
which ends up either more efficient or at least less complex than
|
||||||
|
working with each item separately or the entire cluster all at once.
|
||||||
|
|
||||||
|
Another configurable value is the replica count, which indicates how
|
||||||
|
many of the partition-device assignments make up a single ring. For a
|
||||||
|
given partition number, each replica's device will not be in the same
|
||||||
|
zone as any other replica's device. Zones can be used to group devices
|
||||||
|
based on physical locations, power separations, network separations, or
|
||||||
|
any other attribute that would improve the availability of multiple
|
||||||
|
replicas at the same time.
|
||||||
|
|
||||||
|
Zones
|
||||||
|
-----
|
||||||
|
|
||||||
|
Object Storage allows configuring zones in order to isolate failure
|
||||||
|
boundaries. If possible, each data replica resides in a separate zone.
|
||||||
|
At the smallest level, a zone could be a single drive or a grouping of a
|
||||||
|
few drives. If there were five object storage servers, then each server
|
||||||
|
would represent its own zone. Larger deployments would have an entire
|
||||||
|
rack (or multiple racks) of object servers, each representing a zone.
|
||||||
|
The goal of zones is to allow the cluster to tolerate significant
|
||||||
|
outages of storage servers without losing all replicas of the data.
|
||||||
|
|
||||||
|
|
||||||
|
.. _objectstorage-zones-figure:
|
||||||
|
|
||||||
|
**Zones**
|
||||||
|
|
||||||
|
.. figure:: figures/objectstorage-zones.png
|
||||||
|
|
||||||
|
|
||||||
|
Accounts and containers
|
||||||
|
-----------------------
|
||||||
|
|
||||||
|
Each account and container is an individual SQLite database that is
|
||||||
|
distributed across the cluster. An account database contains the list of
|
||||||
|
containers in that account. A container database contains the list of
|
||||||
|
objects in that container.
|
||||||
|
|
||||||
|
|
||||||
|
.. _objectstorage-accountscontainers-figure:
|
||||||
|
|
||||||
|
**Accounts and containers**
|
||||||
|
|
||||||
|
.. figure:: figures/objectstorage-accountscontainers.png
|
||||||
|
|
||||||
|
|
||||||
|
To keep track of object data locations, each account in the system has a
|
||||||
|
database that references all of its containers, and each container
|
||||||
|
database references each object.
|
||||||
|
|
||||||
|
Partitions
|
||||||
|
----------
|
||||||
|
|
||||||
|
A partition is a collection of stored data. This includes account databases,
|
||||||
|
container databases, and objects. Partitions are core to the replication
|
||||||
|
system.
|
||||||
|
|
||||||
|
Think of a partition as a bin moving throughout a fulfillment center
|
||||||
|
warehouse. Individual orders get thrown into the bin. The system treats
|
||||||
|
that bin as a cohesive entity as it moves throughout the system. A bin
|
||||||
|
is easier to deal with than many little things. It makes for fewer
|
||||||
|
moving parts throughout the system.
|
||||||
|
|
||||||
|
System replicators and object uploads/downloads operate on partitions.
|
||||||
|
As the system scales up, its behavior continues to be predictable
|
||||||
|
because the number of partitions is a fixed number.
|
||||||
|
|
||||||
|
Implementing a partition is conceptually simple, a partition is just a
|
||||||
|
directory sitting on a disk with a corresponding hash table of what it
|
||||||
|
contains.
|
||||||
|
|
||||||
|
|
||||||
|
.. _objectstorage-partitions-figure:
|
||||||
|
|
||||||
|
**Partitions**
|
||||||
|
|
||||||
|
.. figure:: figures/objectstorage-partitions.png
|
||||||
|
|
||||||
|
|
||||||
|
Replicators
|
||||||
|
-----------
|
||||||
|
|
||||||
|
In order to ensure that there are three copies of the data everywhere,
|
||||||
|
replicators continuously examine each partition. For each local
|
||||||
|
partition, the replicator compares it against the replicated copies in
|
||||||
|
the other zones to see if there are any differences.
|
||||||
|
|
||||||
|
The replicator knows if replication needs to take place by examining
|
||||||
|
hashes. A hash file is created for each partition, which contains hashes
|
||||||
|
of each directory in the partition. Each of the three hash files is
|
||||||
|
compared. For a given partition, the hash files for each of the
|
||||||
|
partition's copies are compared. If the hashes are different, then it is
|
||||||
|
time to replicate, and the directory that needs to be replicated is
|
||||||
|
copied over.
|
||||||
|
|
||||||
|
This is where partitions come in handy. With fewer things in the system,
|
||||||
|
larger chunks of data are transferred around (rather than lots of little
|
||||||
|
TCP connections, which is inefficient) and there is a consistent number
|
||||||
|
of hashes to compare.
|
||||||
|
|
||||||
|
The cluster eventually has a consistent behavior where the newest data
|
||||||
|
has a priority.
|
||||||
|
|
||||||
|
|
||||||
|
.. _objectstorage-replication-figure:
|
||||||
|
|
||||||
|
**Replication**
|
||||||
|
|
||||||
|
.. figure:: figures/objectstorage-replication.png
|
||||||
|
|
||||||
|
|
||||||
|
If a zone goes down, one of the nodes containing a replica notices and
|
||||||
|
proactively copies data to a handoff location.
|
||||||
|
|
||||||
|
Use cases
|
||||||
|
---------
|
||||||
|
|
||||||
|
The following sections show use cases for object uploads and downloads
|
||||||
|
and introduce the components.
|
||||||
|
|
||||||
|
|
||||||
|
Upload
|
||||||
|
~~~~~~
|
||||||
|
|
||||||
|
A client uses the REST API to make a HTTP request to PUT an object into
|
||||||
|
an existing container. The cluster receives the request. First, the
|
||||||
|
system must figure out where the data is going to go. To do this, the
|
||||||
|
account name, container name, and object name are all used to determine
|
||||||
|
the partition where this object should live.
|
||||||
|
|
||||||
|
Then a lookup in the ring figures out which storage nodes contain the
|
||||||
|
partitions in question.
|
||||||
|
|
||||||
|
The data is then sent to each storage node where it is placed in the
|
||||||
|
appropriate partition. At least two of the three writes must be
|
||||||
|
successful before the client is notified that the upload was successful.
|
||||||
|
|
||||||
|
Next, the container database is updated asynchronously to reflect that
|
||||||
|
there is a new object in it.
|
||||||
|
|
||||||
|
|
||||||
|
.. _objectstorage-usecase-figure:
|
||||||
|
|
||||||
|
**Object Storage in use**
|
||||||
|
|
||||||
|
.. figure:: figures/objectstorage-usecase.png
|
||||||
|
|
||||||
|
|
||||||
|
Download
|
||||||
|
~~~~~~~~
|
||||||
|
|
||||||
|
A request comes in for an account/container/object. Using the same
|
||||||
|
consistent hashing, the partition name is generated. A lookup in the
|
||||||
|
ring reveals which storage nodes contain that partition. A request is
|
||||||
|
made to one of the storage nodes to fetch the object and, if that fails,
|
||||||
|
requests are made to the other nodes.
|
63
doc/source/admin/objectstorage-features.rst
Normal file
@ -0,0 +1,63 @@
|
|||||||
|
=====================
|
||||||
|
Features and benefits
|
||||||
|
=====================
|
||||||
|
|
||||||
|
.. list-table::
|
||||||
|
:header-rows: 1
|
||||||
|
:widths: 10 40
|
||||||
|
|
||||||
|
* - Features
|
||||||
|
- Benefits
|
||||||
|
* - Leverages commodity hardware
|
||||||
|
- No lock-in, lower price/GB.
|
||||||
|
* - HDD/node failure agnostic
|
||||||
|
- Self-healing, reliable, data redundancy protects from failures.
|
||||||
|
* - Unlimited storage
|
||||||
|
- Large and flat namespace, highly scalable read/write access,
|
||||||
|
able to serve content directly from storage system.
|
||||||
|
* - Multi-dimensional scalability
|
||||||
|
- Scale-out architecture: Scale vertically and
|
||||||
|
horizontally-distributed storage. Backs up and archives large
|
||||||
|
amounts of data with linear performance.
|
||||||
|
* - Account/container/object structure
|
||||||
|
- No nesting, not a traditional file system: Optimized for scale,
|
||||||
|
it scales to multiple petabytes and billions of objects.
|
||||||
|
* - Built-in replication 3✕ + data redundancy (compared with 2✕ on
|
||||||
|
RAID)
|
||||||
|
- A configurable number of accounts, containers and object copies
|
||||||
|
for high availability.
|
||||||
|
* - Easily add capacity (unlike RAID resize)
|
||||||
|
- Elastic data scaling with ease.
|
||||||
|
* - No central database
|
||||||
|
- Higher performance, no bottlenecks.
|
||||||
|
* - RAID not required
|
||||||
|
- Handle many small, random reads and writes efficiently.
|
||||||
|
* - Built-in management utilities
|
||||||
|
- Account management: Create, add, verify, and delete users;
|
||||||
|
Container management: Upload, download, and verify; Monitoring:
|
||||||
|
Capacity, host, network, log trawling, and cluster health.
|
||||||
|
* - Drive auditing
|
||||||
|
- Detect drive failures preempting data corruption.
|
||||||
|
* - Expiring objects
|
||||||
|
- Users can set an expiration time or a TTL on an object to
|
||||||
|
control access.
|
||||||
|
* - Direct object access
|
||||||
|
- Enable direct browser access to content, such as for a control
|
||||||
|
panel.
|
||||||
|
* - Realtime visibility into client requests
|
||||||
|
- Know what users are requesting.
|
||||||
|
* - Supports S3 API
|
||||||
|
- Utilize tools that were designed for the popular S3 API.
|
||||||
|
* - Restrict containers per account
|
||||||
|
- Limit access to control usage by user.
|
||||||
|
* - Support for NetApp, Nexenta, Solidfire
|
||||||
|
- Unified support for block volumes using a variety of storage
|
||||||
|
systems.
|
||||||
|
* - Snapshot and backup API for block volumes.
|
||||||
|
- Data protection and recovery for VM data.
|
||||||
|
* - Standalone volume API available
|
||||||
|
- Separate endpoint and API for integration with other compute
|
||||||
|
systems.
|
||||||
|
* - Integration with Compute
|
||||||
|
- Fully integrated with Compute for attaching block volumes and
|
||||||
|
reporting on usage.
|
23
doc/source/admin/objectstorage-intro.rst
Normal file
@ -0,0 +1,23 @@
|
|||||||
|
==============================
|
||||||
|
Introduction to Object Storage
|
||||||
|
==============================
|
||||||
|
|
||||||
|
OpenStack Object Storage (swift) is used for redundant, scalable data
|
||||||
|
storage using clusters of standardized servers to store petabytes of
|
||||||
|
accessible data. It is a long-term storage system for large amounts of
|
||||||
|
static data which can be retrieved and updated. Object Storage uses a
|
||||||
|
distributed architecture
|
||||||
|
with no central point of control, providing greater scalability,
|
||||||
|
redundancy, and permanence. Objects are written to multiple hardware
|
||||||
|
devices, with the OpenStack software responsible for ensuring data
|
||||||
|
replication and integrity across the cluster. Storage clusters scale
|
||||||
|
horizontally by adding new nodes. Should a node fail, OpenStack works to
|
||||||
|
replicate its content from other active nodes. Because OpenStack uses
|
||||||
|
software logic to ensure data replication and distribution across
|
||||||
|
different devices, inexpensive commodity hard drives and servers can be
|
||||||
|
used in lieu of more expensive equipment.
|
||||||
|
|
||||||
|
Object Storage is ideal for cost effective, scale-out storage. It
|
||||||
|
provides a fully distributed, API-accessible storage platform that can
|
||||||
|
be integrated directly into applications or used for backup, archiving,
|
||||||
|
and data retention.
|
35
doc/source/admin/objectstorage-large-objects.rst
Normal file
@ -0,0 +1,35 @@
|
|||||||
|
====================
|
||||||
|
Large object support
|
||||||
|
====================
|
||||||
|
|
||||||
|
Object Storage (swift) uses segmentation to support the upload of large
|
||||||
|
objects. By default, Object Storage limits the download size of a single
|
||||||
|
object to 5GB. Using segmentation, uploading a single object is virtually
|
||||||
|
unlimited. The segmentation process works by fragmenting the object,
|
||||||
|
and automatically creating a file that sends the segments together as
|
||||||
|
a single object. This option offers greater upload speed with the possibility
|
||||||
|
of parallel uploads.
|
||||||
|
|
||||||
|
Large objects
|
||||||
|
~~~~~~~~~~~~~
|
||||||
|
The large object is comprised of two types of objects:
|
||||||
|
|
||||||
|
- **Segment objects** store the object content. You can divide your
|
||||||
|
content into segments, and upload each segment into its own segment
|
||||||
|
object. Segment objects do not have any special features. You create,
|
||||||
|
update, download, and delete segment objects just as you would normal
|
||||||
|
objects.
|
||||||
|
|
||||||
|
- A **manifest object** links the segment objects into one logical
|
||||||
|
large object. When you download a manifest object, Object Storage
|
||||||
|
concatenates and returns the contents of the segment objects in the
|
||||||
|
response body of the request. The manifest object types are:
|
||||||
|
|
||||||
|
- **Static large objects**
|
||||||
|
- **Dynamic large objects**
|
||||||
|
|
||||||
|
To find out more information on large object support, see `Large objects
|
||||||
|
<https://docs.openstack.org/user-guide/cli-swift-large-object-creation.html>`_
|
||||||
|
in the OpenStack End User Guide, or `Large Object Support
|
||||||
|
<https://docs.openstack.org/developer/swift/overview_large_objects.html>`_
|
||||||
|
in the developer documentation.
|
228
doc/source/admin/objectstorage-monitoring.rst
Normal file
@ -0,0 +1,228 @@
|
|||||||
|
=========================
|
||||||
|
Object Storage monitoring
|
||||||
|
=========================
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
This section was excerpted from a blog post by `Darrell
|
||||||
|
Bishop <http://swiftstack.com/blog/2012/04/11/swift-monitoring-with-statsd>`_ and
|
||||||
|
has since been edited.
|
||||||
|
|
||||||
|
An OpenStack Object Storage cluster is a collection of many daemons that
|
||||||
|
work together across many nodes. With so many different components, you
|
||||||
|
must be able to tell what is going on inside the cluster. Tracking
|
||||||
|
server-level meters like CPU utilization, load, memory consumption, disk
|
||||||
|
usage and utilization, and so on is necessary, but not sufficient.
|
||||||
|
|
||||||
|
Swift Recon
|
||||||
|
~~~~~~~~~~~
|
||||||
|
|
||||||
|
The Swift Recon middleware (see
|
||||||
|
`Defining Storage Policies <https://docs.openstack.org/developer/swift/admin_guide.html#cluster-telemetry-and-monitoring>`_)
|
||||||
|
provides general machine statistics, such as load average, socket
|
||||||
|
statistics, ``/proc/meminfo`` contents, as well as Swift-specific meters:
|
||||||
|
|
||||||
|
- The ``MD5`` sum of each ring file.
|
||||||
|
|
||||||
|
- The most recent object replication time.
|
||||||
|
|
||||||
|
- Count of each type of quarantined file: Account, container, or
|
||||||
|
object.
|
||||||
|
|
||||||
|
- Count of "async_pendings" (deferred container updates) on disk.
|
||||||
|
|
||||||
|
Swift Recon is middleware that is installed in the object servers
|
||||||
|
pipeline and takes one required option: A local cache directory. To
|
||||||
|
track ``async_pendings``, you must set up an additional cron job for
|
||||||
|
each object server. You access data by either sending HTTP requests
|
||||||
|
directly to the object server or using the ``swift-recon`` command-line
|
||||||
|
client.
|
||||||
|
|
||||||
|
There are Object Storage cluster statistics but the typical
|
||||||
|
server meters overlap with existing server monitoring systems. To get
|
||||||
|
the Swift-specific meters into a monitoring system, they must be polled.
|
||||||
|
Swift Recon acts as a middleware meters collector. The
|
||||||
|
process that feeds meters to your statistics system, such as
|
||||||
|
``collectd`` and ``gmond``, should already run on the storage node.
|
||||||
|
You can choose to either talk to Swift Recon or collect the meters
|
||||||
|
directly.
|
||||||
|
|
||||||
|
Swift-Informant
|
||||||
|
~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
Swift-Informant middleware (see
|
||||||
|
`swift-informant <https://github.com/pandemicsyn/swift-informant>`_) has
|
||||||
|
real-time visibility into Object Storage client requests. It sits in the
|
||||||
|
pipeline for the proxy server, and after each request to the proxy server it
|
||||||
|
sends three meters to a ``StatsD`` server:
|
||||||
|
|
||||||
|
- A counter increment for a meter like ``obj.GET.200`` or
|
||||||
|
``cont.PUT.404``.
|
||||||
|
|
||||||
|
- Timing data for a meter like ``acct.GET.200`` or ``obj.GET.200``.
|
||||||
|
[The README says the meters look like ``duration.acct.GET.200``, but
|
||||||
|
I do not see the ``duration`` in the code. I am not sure what the
|
||||||
|
Etsy server does but our StatsD server turns timing meters into five
|
||||||
|
derivative meters with new segments appended, so it probably works as
|
||||||
|
coded. The first meter turns into ``acct.GET.200.lower``,
|
||||||
|
``acct.GET.200.upper``, ``acct.GET.200.mean``,
|
||||||
|
``acct.GET.200.upper_90``, and ``acct.GET.200.count``].
|
||||||
|
|
||||||
|
- A counter increase by the bytes transferred for a meter like
|
||||||
|
``tfer.obj.PUT.201``.
|
||||||
|
|
||||||
|
This is used for receiving information on the quality of service clients
|
||||||
|
experience with the timing meters, as well as sensing the volume of the
|
||||||
|
various modifications of a request server type, command, and response
|
||||||
|
code. Swift-Informant requires no change to core Object
|
||||||
|
Storage code because it is implemented as middleware. However, it gives
|
||||||
|
no insight into the workings of the cluster past the proxy server.
|
||||||
|
If the responsiveness of one storage node degrades, you can only see
|
||||||
|
that some of the requests are bad, either as high latency or error
|
||||||
|
status codes.
|
||||||
|
|
||||||
|
Statsdlog
|
||||||
|
~~~~~~~~~
|
||||||
|
|
||||||
|
The `Statsdlog <https://github.com/pandemicsyn/statsdlog>`_
|
||||||
|
project increments StatsD counters based on logged events. Like
|
||||||
|
Swift-Informant, it is also non-intrusive, however statsdlog can track
|
||||||
|
events from all Object Storage daemons, not just proxy-server. The
|
||||||
|
daemon listens to a UDP stream of syslog messages, and StatsD counters
|
||||||
|
are incremented when a log line matches a regular expression. Meter
|
||||||
|
names are mapped to regex match patterns in a JSON file, allowing
|
||||||
|
flexible configuration of what meters are extracted from the log stream.
|
||||||
|
|
||||||
|
Currently, only the first matching regex triggers a StatsD counter
|
||||||
|
increment, and the counter is always incremented by one. There is no way
|
||||||
|
to increment a counter by more than one or send timing data to StatsD
|
||||||
|
based on the log line content. The tool could be extended to handle more
|
||||||
|
meters for each line and data extraction, including timing data. But a
|
||||||
|
coupling would still exist between the log textual format and the log
|
||||||
|
parsing regexes, which would themselves be more complex to support
|
||||||
|
multiple matches for each line and data extraction. Also, log processing
|
||||||
|
introduces a delay between the triggering event and sending the data to
|
||||||
|
StatsD. It would be preferable to increment error counters where they
|
||||||
|
occur and send timing data as soon as it is known to avoid coupling
|
||||||
|
between a log string and a parsing regex and prevent a time delay
|
||||||
|
between events and sending data to StatsD.
|
||||||
|
|
||||||
|
The next section describes another method for gathering Object Storage
|
||||||
|
operational meters.
|
||||||
|
|
||||||
|
Swift StatsD logging
|
||||||
|
~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
StatsD (see `Measure Anything, Measure Everything
|
||||||
|
<http://codeascraft.etsy.com/2011/02/15/measure-anything-measure-everything/>`_)
|
||||||
|
was designed for application code to be deeply instrumented. Meters are
|
||||||
|
sent in real-time by the code that just noticed or did something. The
|
||||||
|
overhead of sending a meter is extremely low: a ``sendto`` of one UDP
|
||||||
|
packet. If that overhead is still too high, the StatsD client library
|
||||||
|
can send only a random portion of samples and StatsD approximates the
|
||||||
|
actual number when flushing meters upstream.
|
||||||
|
|
||||||
|
To avoid the problems inherent with middleware-based monitoring and
|
||||||
|
after-the-fact log processing, the sending of StatsD meters is
|
||||||
|
integrated into Object Storage itself. The submitted change set (see
|
||||||
|
`<https://review.openstack.org/#change,6058>`_) currently reports 124 meters
|
||||||
|
across 15 Object Storage daemons and the tempauth middleware. Details of
|
||||||
|
the meters tracked are in the `Administrator's
|
||||||
|
Guide <https://docs.openstack.org/developer/swift/admin_guide.html>`_.
|
||||||
|
|
||||||
|
The sending of meters is integrated with the logging framework. To
|
||||||
|
enable, configure ``log_statsd_host`` in the relevant config file. You
|
||||||
|
can also specify the port and a default sample rate. The specified
|
||||||
|
default sample rate is used unless a specific call to a statsd logging
|
||||||
|
method (see the list below) overrides it. Currently, no logging calls
|
||||||
|
override the sample rate, but it is conceivable that some meters may
|
||||||
|
require accuracy (``sample_rate=1``) while others may not.
|
||||||
|
|
||||||
|
.. code-block:: ini
|
||||||
|
|
||||||
|
[DEFAULT]
|
||||||
|
# ...
|
||||||
|
log_statsd_host = 127.0.0.1
|
||||||
|
log_statsd_port = 8125
|
||||||
|
log_statsd_default_sample_rate = 1
|
||||||
|
|
||||||
|
Then the LogAdapter object returned by ``get_logger()``, usually stored
|
||||||
|
in ``self.logger``, has these new methods:
|
||||||
|
|
||||||
|
- ``set_statsd_prefix(self, prefix)`` Sets the client library stat
|
||||||
|
prefix value which gets prefixed to every meter. The default prefix
|
||||||
|
is the ``name`` of the logger such as ``object-server``,
|
||||||
|
``container-auditor``, and so on. This is currently used to turn
|
||||||
|
``proxy-server`` into one of ``proxy-server.Account``,
|
||||||
|
``proxy-server.Container``, or ``proxy-server.Object`` as soon as the
|
||||||
|
Controller object is determined and instantiated for the request.
|
||||||
|
|
||||||
|
- ``update_stats(self, metric, amount, sample_rate=1)`` Increments
|
||||||
|
the supplied meter by the given amount. This is used when you need
|
||||||
|
to add or subtract more that one from a counter, like incrementing
|
||||||
|
``suffix.hashes`` by the number of computed hashes in the object
|
||||||
|
replicator.
|
||||||
|
|
||||||
|
- ``increment(self, metric, sample_rate=1)`` Increments the given counter
|
||||||
|
meter by one.
|
||||||
|
|
||||||
|
- ``decrement(self, metric, sample_rate=1)`` Lowers the given counter
|
||||||
|
meter by one.
|
||||||
|
|
||||||
|
- ``timing(self, metric, timing_ms, sample_rate=1)`` Record that the
|
||||||
|
given meter took the supplied number of milliseconds.
|
||||||
|
|
||||||
|
- ``timing_since(self, metric, orig_time, sample_rate=1)``
|
||||||
|
Convenience method to record a timing meter whose value is "now"
|
||||||
|
minus an existing timestamp.
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
These logging methods may safely be called anywhere you have a
|
||||||
|
logger object. If StatsD logging has not been configured, the methods
|
||||||
|
are no-ops. This avoids messy conditional logic each place a meter is
|
||||||
|
recorded. These example usages show the new logging methods:
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
# swift/obj/replicator.py
|
||||||
|
def update(self, job):
|
||||||
|
# ...
|
||||||
|
begin = time.time()
|
||||||
|
try:
|
||||||
|
hashed, local_hash = tpool.execute(tpooled_get_hashes, job['path'],
|
||||||
|
do_listdir=(self.replication_count % 10) == 0,
|
||||||
|
reclaim_age=self.reclaim_age)
|
||||||
|
# See tpooled_get_hashes "Hack".
|
||||||
|
if isinstance(hashed, BaseException):
|
||||||
|
raise hashed
|
||||||
|
self.suffix_hash += hashed
|
||||||
|
self.logger.update_stats('suffix.hashes', hashed)
|
||||||
|
# ...
|
||||||
|
finally:
|
||||||
|
self.partition_times.append(time.time() - begin)
|
||||||
|
self.logger.timing_since('partition.update.timing', begin)
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
# swift/container/updater.py
|
||||||
|
def process_container(self, dbfile):
|
||||||
|
# ...
|
||||||
|
start_time = time.time()
|
||||||
|
# ...
|
||||||
|
for event in events:
|
||||||
|
if 200 <= event.wait() < 300:
|
||||||
|
successes += 1
|
||||||
|
else:
|
||||||
|
failures += 1
|
||||||
|
if successes > failures:
|
||||||
|
self.logger.increment('successes')
|
||||||
|
# ...
|
||||||
|
else:
|
||||||
|
self.logger.increment('failures')
|
||||||
|
# ...
|
||||||
|
# Only track timing data for attempted updates:
|
||||||
|
self.logger.timing_since('timing', start_time)
|
||||||
|
else:
|
||||||
|
self.logger.increment('no_changes')
|
||||||
|
self.no_changes += 1
|
98
doc/source/admin/objectstorage-replication.rst
Normal file
@ -0,0 +1,98 @@
|
|||||||
|
===========
|
||||||
|
Replication
|
||||||
|
===========
|
||||||
|
|
||||||
|
Because each replica in Object Storage functions independently and
|
||||||
|
clients generally require only a simple majority of nodes to respond to
|
||||||
|
consider an operation successful, transient failures like network
|
||||||
|
partitions can quickly cause replicas to diverge. These differences are
|
||||||
|
eventually reconciled by asynchronous, peer-to-peer replicator
|
||||||
|
processes. The replicator processes traverse their local file systems
|
||||||
|
and concurrently perform operations in a manner that balances load
|
||||||
|
across physical disks.
|
||||||
|
|
||||||
|
Replication uses a push model, with records and files generally only
|
||||||
|
being copied from local to remote replicas. This is important because
|
||||||
|
data on the node might not belong there (as in the case of hand offs and
|
||||||
|
ring changes), and a replicator cannot know which data it should pull in
|
||||||
|
from elsewhere in the cluster. Any node that contains data must ensure
|
||||||
|
that data gets to where it belongs. The ring handles replica placement.
|
||||||
|
|
||||||
|
To replicate deletions in addition to creations, every deleted record or
|
||||||
|
file in the system is marked by a tombstone. The replication process
|
||||||
|
cleans up tombstones after a time period known as the ``consistency
|
||||||
|
window``. This window defines the duration of the replication and how
|
||||||
|
long transient failure can remove a node from the cluster. Tombstone
|
||||||
|
cleanup must be tied to replication to reach replica convergence.
|
||||||
|
|
||||||
|
If a replicator detects that a remote drive has failed, the replicator
|
||||||
|
uses the ``get_more_nodes`` interface for the ring to choose an
|
||||||
|
alternate node with which to synchronize. The replicator can maintain
|
||||||
|
desired levels of replication during disk failures, though some replicas
|
||||||
|
might not be in an immediately usable location.
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
The replicator does not maintain desired levels of replication when
|
||||||
|
failures such as entire node failures occur; most failures are
|
||||||
|
transient.
|
||||||
|
|
||||||
|
The main replication types are:
|
||||||
|
|
||||||
|
- Database replication
|
||||||
|
Replicates containers and objects.
|
||||||
|
|
||||||
|
- Object replication
|
||||||
|
Replicates object data.
|
||||||
|
|
||||||
|
Database replication
|
||||||
|
~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
Database replication completes a low-cost hash comparison to determine
|
||||||
|
whether two replicas already match. Normally, this check can quickly
|
||||||
|
verify that most databases in the system are already synchronized. If
|
||||||
|
the hashes differ, the replicator synchronizes the databases by sharing
|
||||||
|
records added since the last synchronization point.
|
||||||
|
|
||||||
|
This synchronization point is a high water mark that notes the last
|
||||||
|
record at which two databases were known to be synchronized, and is
|
||||||
|
stored in each database as a tuple of the remote database ID and record
|
||||||
|
ID. Database IDs are unique across all replicas of the database, and
|
||||||
|
record IDs are monotonically increasing integers. After all new records
|
||||||
|
are pushed to the remote database, the entire synchronization table of
|
||||||
|
the local database is pushed, so the remote database can guarantee that
|
||||||
|
it is synchronized with everything with which the local database was
|
||||||
|
previously synchronized.
|
||||||
|
|
||||||
|
If a replica is missing, the whole local database file is transmitted to
|
||||||
|
the peer by using rsync(1) and is assigned a new unique ID.
|
||||||
|
|
||||||
|
In practice, database replication can process hundreds of databases per
|
||||||
|
concurrency setting per second (up to the number of available CPUs or
|
||||||
|
disks) and is bound by the number of database transactions that must be
|
||||||
|
performed.
|
||||||
|
|
||||||
|
Object replication
|
||||||
|
~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
The initial implementation of object replication performed an rsync to
|
||||||
|
push data from a local partition to all remote servers where it was
|
||||||
|
expected to reside. While this worked at small scale, replication times
|
||||||
|
skyrocketed once directory structures could no longer be held in RAM.
|
||||||
|
This scheme was modified to save a hash of the contents for each suffix
|
||||||
|
directory to a per-partition hashes file. The hash for a suffix
|
||||||
|
directory is no longer valid when the contents of that suffix directory
|
||||||
|
is modified.
|
||||||
|
|
||||||
|
The object replication process reads in hash files and calculates any
|
||||||
|
invalidated hashes. Then, it transmits the hashes to each remote server
|
||||||
|
that should hold the partition, and only suffix directories with
|
||||||
|
differing hashes on the remote server are rsynced. After pushing files
|
||||||
|
to the remote server, the replication process notifies it to recalculate
|
||||||
|
hashes for the rsynced suffix directories.
|
||||||
|
|
||||||
|
The number of uncached directories that object replication must
|
||||||
|
traverse, usually as a result of invalidated suffix directory hashes,
|
||||||
|
impedes performance. To provide acceptable replication speeds, object
|
||||||
|
replication is designed to invalidate around 2 percent of the hash space
|
||||||
|
on a normal node each day.
|
228
doc/source/admin/objectstorage-ringbuilder.rst
Normal file
@ -0,0 +1,228 @@
|
|||||||
|
============
|
||||||
|
Ring-builder
|
||||||
|
============
|
||||||
|
|
||||||
|
Use the swift-ring-builder utility to build and manage rings. This
|
||||||
|
utility assigns partitions to devices and writes an optimized Python
|
||||||
|
structure to a gzipped, serialized file on disk for transmission to the
|
||||||
|
servers. The server processes occasionally check the modification time
|
||||||
|
of the file and reload in-memory copies of the ring structure as needed.
|
||||||
|
If you use a slightly older version of the ring, one of the three
|
||||||
|
replicas for a partition subset will be incorrect because of the way the
|
||||||
|
ring-builder manages changes to the ring. You can work around this
|
||||||
|
issue.
|
||||||
|
|
||||||
|
The ring-builder also keeps its own builder file with the ring
|
||||||
|
information and additional data required to build future rings. It is
|
||||||
|
very important to keep multiple backup copies of these builder files.
|
||||||
|
One option is to copy the builder files out to every server while
|
||||||
|
copying the ring files themselves. Another is to upload the builder
|
||||||
|
files into the cluster itself. If you lose the builder file, you have to
|
||||||
|
create a new ring from scratch. Nearly all partitions would be assigned
|
||||||
|
to different devices and, therefore, nearly all of the stored data would
|
||||||
|
have to be replicated to new locations. So, recovery from a builder file
|
||||||
|
loss is possible, but data would be unreachable for an extended time.
|
||||||
|
|
||||||
|
Ring data structure
|
||||||
|
~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
The ring data structure consists of three top level fields: a list of
|
||||||
|
devices in the cluster, a list of lists of device ids indicating
|
||||||
|
partition to device assignments, and an integer indicating the number of
|
||||||
|
bits to shift an MD5 hash to calculate the partition for the hash.
|
||||||
|
|
||||||
|
Partition assignment list
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
This is a list of ``array('H')`` of devices ids. The outermost list
|
||||||
|
contains an ``array('H')`` for each replica. Each ``array('H')`` has a
|
||||||
|
length equal to the partition count for the ring. Each integer in the
|
||||||
|
``array('H')`` is an index into the above list of devices. The partition
|
||||||
|
list is known internally to the Ring class as ``_replica2part2dev_id``.
|
||||||
|
|
||||||
|
So, to create a list of device dictionaries assigned to a partition, the
|
||||||
|
Python code would look like:
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
devices = [self.devs[part2dev_id[partition]] for
|
||||||
|
part2dev_id in self._replica2part2dev_id]
|
||||||
|
|
||||||
|
That code is a little simplistic because it does not account for the
|
||||||
|
removal of duplicate devices. If a ring has more replicas than devices,
|
||||||
|
a partition will have more than one replica on a device.
|
||||||
|
|
||||||
|
``array('H')`` is used for memory conservation as there may be millions
|
||||||
|
of partitions.
|
||||||
|
|
||||||
|
Overload
|
||||||
|
~~~~~~~~
|
||||||
|
|
||||||
|
The ring builder tries to keep replicas as far apart as possible while
|
||||||
|
still respecting device weights. When it can not do both, the overload
|
||||||
|
factor determines what happens. Each device takes an extra
|
||||||
|
fraction of its desired partitions to allow for replica dispersion;
|
||||||
|
after that extra fraction is exhausted, replicas are placed closer
|
||||||
|
together than optimal.
|
||||||
|
|
||||||
|
The overload factor lets the operator trade off replica
|
||||||
|
dispersion (durability) against data dispersion (uniform disk usage).
|
||||||
|
|
||||||
|
The default overload factor is 0, so device weights are strictly
|
||||||
|
followed.
|
||||||
|
|
||||||
|
With an overload factor of 0.1, each device accepts 10% more
|
||||||
|
partitions than it otherwise would, but only if it needs to maintain
|
||||||
|
partition dispersion.
|
||||||
|
|
||||||
|
For example, consider a 3-node cluster of machines with equal-size disks;
|
||||||
|
node A has 12 disks, node B has 12 disks, and node C has
|
||||||
|
11 disks. The ring has an overload factor of 0.1 (10%).
|
||||||
|
|
||||||
|
Without the overload, some partitions would end up with replicas only
|
||||||
|
on nodes A and B. However, with the overload, every device can accept
|
||||||
|
up to 10% more partitions for the sake of dispersion. The
|
||||||
|
missing disk in C means there is one disk's worth of partitions
|
||||||
|
to spread across the remaining 11 disks, which gives each
|
||||||
|
disk in C an extra 9.09% load. Since this is less than the 10%
|
||||||
|
overload, there is one replica of each partition on each node.
|
||||||
|
|
||||||
|
However, this does mean that the disks in node C have more data
|
||||||
|
than the disks in nodes A and B. If 80% full is the warning
|
||||||
|
threshold for the cluster, node C's disks reach 80% full while A
|
||||||
|
and B's disks are only 72.7% full.
|
||||||
|
|
||||||
|
|
||||||
|
Replica counts
|
||||||
|
~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
To support the gradual change in replica counts, a ring can have a real
|
||||||
|
number of replicas and is not restricted to an integer number of
|
||||||
|
replicas.
|
||||||
|
|
||||||
|
A fractional replica count is for the whole ring and not for individual
|
||||||
|
partitions. It indicates the average number of replicas for each
|
||||||
|
partition. For example, a replica count of 3.2 means that 20 percent of
|
||||||
|
partitions have four replicas and 80 percent have three replicas.
|
||||||
|
|
||||||
|
The replica count is adjustable. For example:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
$ swift-ring-builder account.builder set_replicas 4
|
||||||
|
$ swift-ring-builder account.builder rebalance
|
||||||
|
|
||||||
|
You must rebalance the replica ring in globally distributed clusters.
|
||||||
|
Operators of these clusters generally want an equal number of replicas
|
||||||
|
and regions. Therefore, when an operator adds or removes a region, the
|
||||||
|
operator adds or removes a replica. Removing unneeded replicas saves on
|
||||||
|
the cost of disks.
|
||||||
|
|
||||||
|
You can gradually increase the replica count at a rate that does not
|
||||||
|
adversely affect cluster performance. For example:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
$ swift-ring-builder object.builder set_replicas 3.01
|
||||||
|
$ swift-ring-builder object.builder rebalance
|
||||||
|
<distribute rings and wait>...
|
||||||
|
|
||||||
|
$ swift-ring-builder object.builder set_replicas 3.02
|
||||||
|
$ swift-ring-builder object.builder rebalance
|
||||||
|
<distribute rings and wait>...
|
||||||
|
|
||||||
|
Changes take effect after the ring is rebalanced. Therefore, if you
|
||||||
|
intend to change from 3 replicas to 3.01 but you accidentally type
|
||||||
|
2.01, no data is lost.
|
||||||
|
|
||||||
|
Additionally, the :command:`swift-ring-builder X.builder create` command can
|
||||||
|
now take a decimal argument for the number of replicas.
|
||||||
|
|
||||||
|
Partition shift value
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
The partition shift value is known internally to the Ring class as
|
||||||
|
``_part_shift``. This value is used to shift an MD5 hash to calculate
|
||||||
|
the partition where the data for that hash should reside. Only the top
|
||||||
|
four bytes of the hash is used in this process. For example, to compute
|
||||||
|
the partition for the ``/account/container/object`` path using Python:
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
partition = unpack_from('>I',
|
||||||
|
md5('/account/container/object').digest())[0] >>
|
||||||
|
self._part_shift
|
||||||
|
|
||||||
|
For a ring generated with part\_power P, the partition shift value is
|
||||||
|
``32 - P``.
|
||||||
|
|
||||||
|
Build the ring
|
||||||
|
~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
The ring builder process includes these high-level steps:
|
||||||
|
|
||||||
|
#. The utility calculates the number of partitions to assign to each
|
||||||
|
device based on the weight of the device. For example, for a
|
||||||
|
partition at the power of 20, the ring has 1,048,576 partitions. One
|
||||||
|
thousand devices of equal weight each want 1,048.576 partitions. The
|
||||||
|
devices are sorted by the number of partitions they desire and kept
|
||||||
|
in order throughout the initialization process.
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
Each device is also assigned a random tiebreaker value that is
|
||||||
|
used when two devices desire the same number of partitions. This
|
||||||
|
tiebreaker is not stored on disk anywhere, and so two different
|
||||||
|
rings created with the same parameters will have different
|
||||||
|
partition assignments. For repeatable partition assignments,
|
||||||
|
``RingBuilder.rebalance()`` takes an optional seed value that
|
||||||
|
seeds the Python pseudo-random number generator.
|
||||||
|
|
||||||
|
#. The ring builder assigns each partition replica to the device that
|
||||||
|
requires most partitions at that point while keeping it as far away
|
||||||
|
as possible from other replicas. The ring builder prefers to assign a
|
||||||
|
replica to a device in a region that does not already have a replica.
|
||||||
|
If no such region is available, the ring builder searches for a
|
||||||
|
device in a different zone, or on a different server. If it does not
|
||||||
|
find one, it looks for a device with no replicas. Finally, if all
|
||||||
|
options are exhausted, the ring builder assigns the replica to the
|
||||||
|
device that has the fewest replicas already assigned.
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
The ring builder assigns multiple replicas to one device only if
|
||||||
|
the ring has fewer devices than it has replicas.
|
||||||
|
|
||||||
|
#. When building a new ring from an old ring, the ring builder
|
||||||
|
recalculates the desired number of partitions that each device wants.
|
||||||
|
|
||||||
|
#. The ring builder unassigns partitions and gathers these partitions
|
||||||
|
for reassignment, as follows:
|
||||||
|
|
||||||
|
- The ring builder unassigns any assigned partitions from any
|
||||||
|
removed devices and adds these partitions to the gathered list.
|
||||||
|
- The ring builder unassigns any partition replicas that can be
|
||||||
|
spread out for better durability and adds these partitions to the
|
||||||
|
gathered list.
|
||||||
|
- The ring builder unassigns random partitions from any devices that
|
||||||
|
have more partitions than they need and adds these partitions to
|
||||||
|
the gathered list.
|
||||||
|
|
||||||
|
#. The ring builder reassigns the gathered partitions to devices by
|
||||||
|
using a similar method to the one described previously.
|
||||||
|
|
||||||
|
#. When the ring builder reassigns a replica to a partition, the ring
|
||||||
|
builder records the time of the reassignment. The ring builder uses
|
||||||
|
this value when it gathers partitions for reassignment so that no
|
||||||
|
partition is moved twice in a configurable amount of time. The
|
||||||
|
RingBuilder class knows this configurable amount of time as
|
||||||
|
``min_part_hours``. The ring builder ignores this restriction for
|
||||||
|
replicas of partitions on removed devices because removal of a device
|
||||||
|
happens on device failure only, and reassignment is the only choice.
|
||||||
|
|
||||||
|
These steps do not always perfectly rebalance a ring due to the random
|
||||||
|
nature of gathering partitions for reassignment. To help reach a more
|
||||||
|
balanced ring, the rebalance process is repeated until near perfect
|
||||||
|
(less than 1 percent off) or when the balance does not improve by at
|
||||||
|
least 1 percent (indicating we probably cannot get perfect balance due
|
||||||
|
to wildly imbalanced zones or too many partitions recently moved).
|
@ -0,0 +1,32 @@
|
|||||||
|
==============================================================
|
||||||
|
Configure project-specific image locations with Object Storage
|
||||||
|
==============================================================
|
||||||
|
|
||||||
|
For some deployers, it is not ideal to store all images in one place to
|
||||||
|
enable all projects and users to access them. You can configure the Image
|
||||||
|
service to store image data in project-specific image locations. Then,
|
||||||
|
only the following projects can use the Image service to access the
|
||||||
|
created image:
|
||||||
|
|
||||||
|
- The project who owns the image
|
||||||
|
- Projects that are defined in ``swift_store_admin_tenants`` and that
|
||||||
|
have admin-level accounts
|
||||||
|
|
||||||
|
**To configure project-specific image locations**
|
||||||
|
|
||||||
|
#. Configure swift as your ``default_store`` in the
|
||||||
|
``glance-api.conf`` file.
|
||||||
|
|
||||||
|
#. Set these configuration options in the ``glance-api.conf`` file:
|
||||||
|
|
||||||
|
- swift_store_multi_tenant
|
||||||
|
Set to ``True`` to enable tenant-specific storage locations.
|
||||||
|
Default is ``False``.
|
||||||
|
|
||||||
|
- swift_store_admin_tenants
|
||||||
|
Specify a list of tenant IDs that can grant read and write access to all
|
||||||
|
Object Storage containers that are created by the Image service.
|
||||||
|
|
||||||
|
With this configuration, images are stored in an Object Storage service
|
||||||
|
(swift) endpoint that is pulled from the service catalog for the
|
||||||
|
authenticated user.
|
208
doc/source/admin/objectstorage-troubleshoot.rst
Normal file
@ -0,0 +1,208 @@
|
|||||||
|
===========================
|
||||||
|
Troubleshoot Object Storage
|
||||||
|
===========================
|
||||||
|
|
||||||
|
For Object Storage, everything is logged in ``/var/log/syslog`` (or
|
||||||
|
``messages`` on some distros). Several settings enable further
|
||||||
|
customization of logging, such as ``log_name``, ``log_facility``, and
|
||||||
|
``log_level``, within the object server configuration files.
|
||||||
|
|
||||||
|
Drive failure
|
||||||
|
~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
Problem
|
||||||
|
-------
|
||||||
|
|
||||||
|
Drive failure can prevent Object Storage performing replication.
|
||||||
|
|
||||||
|
Solution
|
||||||
|
--------
|
||||||
|
|
||||||
|
In the event that a drive has failed, the first step is to make sure the
|
||||||
|
drive is unmounted. This will make it easier for Object Storage to work
|
||||||
|
around the failure until it has been resolved. If the drive is going to
|
||||||
|
be replaced immediately, then it is just best to replace the drive,
|
||||||
|
format it, remount it, and let replication fill it up.
|
||||||
|
|
||||||
|
If you cannot replace the drive immediately, then it is best to leave it
|
||||||
|
unmounted, and remove the drive from the ring. This will allow all the
|
||||||
|
replicas that were on that drive to be replicated elsewhere until the
|
||||||
|
drive is replaced. Once the drive is replaced, it can be re-added to the
|
||||||
|
ring.
|
||||||
|
|
||||||
|
You can look at error messages in the ``/var/log/kern.log`` file for
|
||||||
|
hints of drive failure.
|
||||||
|
|
||||||
|
Server failure
|
||||||
|
~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
Problem
|
||||||
|
-------
|
||||||
|
|
||||||
|
The server is potentially offline, and may have failed, or require a
|
||||||
|
reboot.
|
||||||
|
|
||||||
|
Solution
|
||||||
|
--------
|
||||||
|
|
||||||
|
If a server is having hardware issues, it is a good idea to make sure
|
||||||
|
the Object Storage services are not running. This will allow Object
|
||||||
|
Storage to work around the failure while you troubleshoot.
|
||||||
|
|
||||||
|
If the server just needs a reboot, or a small amount of work that should
|
||||||
|
only last a couple of hours, then it is probably best to let Object
|
||||||
|
Storage work around the failure and get the machine fixed and back
|
||||||
|
online. When the machine comes back online, replication will make sure
|
||||||
|
that anything that is missing during the downtime will get updated.
|
||||||
|
|
||||||
|
If the server has more serious issues, then it is probably best to
|
||||||
|
remove all of the server's devices from the ring. Once the server has
|
||||||
|
been repaired and is back online, the server's devices can be added back
|
||||||
|
into the ring. It is important that the devices are reformatted before
|
||||||
|
putting them back into the ring as it is likely to be responsible for a
|
||||||
|
different set of partitions than before.
|
||||||
|
|
||||||
|
Detect failed drives
|
||||||
|
~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
Problem
|
||||||
|
-------
|
||||||
|
|
||||||
|
When drives fail, it can be difficult to detect that a drive has failed,
|
||||||
|
and the details of the failure.
|
||||||
|
|
||||||
|
Solution
|
||||||
|
--------
|
||||||
|
|
||||||
|
It has been our experience that when a drive is about to fail, error
|
||||||
|
messages appear in the ``/var/log/kern.log`` file. There is a script called
|
||||||
|
``swift-drive-audit`` that can be run via cron to watch for bad drives. If
|
||||||
|
errors are detected, it will unmount the bad drive, so that Object
|
||||||
|
Storage can work around it. The script takes a configuration file with
|
||||||
|
the following settings:
|
||||||
|
|
||||||
|
.. list-table:: **Description of configuration options for [drive-audit] in drive-audit.conf**
|
||||||
|
:header-rows: 1
|
||||||
|
|
||||||
|
* - Configuration option = Default value
|
||||||
|
- Description
|
||||||
|
* - ``device_dir = /srv/node``
|
||||||
|
- Directory devices are mounted under
|
||||||
|
* - ``error_limit = 1``
|
||||||
|
- Number of errors to find before a device is unmounted
|
||||||
|
* - ``log_address = /dev/log``
|
||||||
|
- Location where syslog sends the logs to
|
||||||
|
* - ``log_facility = LOG_LOCAL0``
|
||||||
|
- Syslog log facility
|
||||||
|
* - ``log_file_pattern = /var/log/kern.*[!.][!g][!z]``
|
||||||
|
- Location of the log file with globbing pattern to check against device
|
||||||
|
errors locate device blocks with errors in the log file
|
||||||
|
* - ``log_level = INFO``
|
||||||
|
- Logging level
|
||||||
|
* - ``log_max_line_length = 0``
|
||||||
|
- Caps the length of log lines to the value given; no limit if set to 0,
|
||||||
|
the default.
|
||||||
|
* - ``log_to_console = False``
|
||||||
|
- No help text available for this option.
|
||||||
|
* - ``minutes = 60``
|
||||||
|
- Number of minutes to look back in ``/var/log/kern.log``
|
||||||
|
* - ``recon_cache_path = /var/cache/swift``
|
||||||
|
- Directory where stats for a few items will be stored
|
||||||
|
* - ``regex_pattern_1 = \berror\b.*\b(dm-[0-9]{1,2}\d?)\b``
|
||||||
|
- No help text available for this option.
|
||||||
|
* - ``unmount_failed_device = True``
|
||||||
|
- No help text available for this option.
|
||||||
|
|
||||||
|
.. warning::
|
||||||
|
|
||||||
|
This script has only been tested on Ubuntu 10.04; use with caution on
|
||||||
|
other operating systems in production.
|
||||||
|
|
||||||
|
Emergency recovery of ring builder files
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
Problem
|
||||||
|
-------
|
||||||
|
|
||||||
|
An emergency might prevent a successful backup from restoring the
|
||||||
|
cluster to operational status.
|
||||||
|
|
||||||
|
Solution
|
||||||
|
--------
|
||||||
|
|
||||||
|
You should always keep a backup of swift ring builder files. However, if
|
||||||
|
an emergency occurs, this procedure may assist in returning your cluster
|
||||||
|
to an operational state.
|
||||||
|
|
||||||
|
Using existing swift tools, there is no way to recover a builder file
|
||||||
|
from a ``ring.gz`` file. However, if you have a knowledge of Python, it
|
||||||
|
is possible to construct a builder file that is pretty close to the one
|
||||||
|
you have lost.
|
||||||
|
|
||||||
|
.. warning::
|
||||||
|
|
||||||
|
This procedure is a last-resort for emergency circumstances. It
|
||||||
|
requires knowledge of the swift python code and may not succeed.
|
||||||
|
|
||||||
|
#. Load the ring and a new ringbuilder object in a Python REPL:
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
>>> from swift.common.ring import RingData, RingBuilder
|
||||||
|
>>> ring = RingData.load('/path/to/account.ring.gz')
|
||||||
|
|
||||||
|
#. Start copying the data we have in the ring into the builder:
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
>>> import math
|
||||||
|
>>> partitions = len(ring._replica2part2dev_id[0])
|
||||||
|
>>> replicas = len(ring._replica2part2dev_id)
|
||||||
|
|
||||||
|
>>> builder = RingBuilder(int(math.log(partitions, 2)), replicas, 1)
|
||||||
|
>>> builder.devs = ring.devs
|
||||||
|
>>> builder._replica2part2dev = ring._replica2part2dev_id
|
||||||
|
>>> builder._last_part_moves_epoch = 0
|
||||||
|
>>> from array import array
|
||||||
|
>>> builder._last_part_moves = array('B', (0 for _ in xrange(partitions)))
|
||||||
|
>>> builder._set_parts_wanted()
|
||||||
|
>>> for d in builder._iter_devs():
|
||||||
|
d['parts'] = 0
|
||||||
|
>>> for p2d in builder._replica2part2dev:
|
||||||
|
for dev_id in p2d:
|
||||||
|
builder.devs[dev_id]['parts'] += 1
|
||||||
|
|
||||||
|
This is the extent of the recoverable fields.
|
||||||
|
|
||||||
|
#. For ``min_part_hours`` you either have to remember what the value you
|
||||||
|
used was, or just make up a new one:
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
>>> builder.change_min_part_hours(24) # or whatever you want it to be
|
||||||
|
|
||||||
|
#. Validate the builder. If this raises an exception, check your
|
||||||
|
previous code:
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
>>> builder.validate()
|
||||||
|
|
||||||
|
#. After it validates, save the builder and create a new ``account.builder``:
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
>>> import pickle
|
||||||
|
>>> pickle.dump(builder.to_dict(), open('account.builder', 'wb'), protocol=2)
|
||||||
|
>>> exit ()
|
||||||
|
|
||||||
|
#. You should now have a file called ``account.builder`` in the current
|
||||||
|
working directory. Run
|
||||||
|
:command:`swift-ring-builder account.builder write_ring` and compare the new
|
||||||
|
``account.ring.gz`` to the ``account.ring.gz`` that you started
|
||||||
|
from. They probably are not byte-for-byte identical, but if you load them
|
||||||
|
in a REPL and their ``_replica2part2dev_id`` and ``devs`` attributes are
|
||||||
|
the same (or nearly so), then you are in good shape.
|
||||||
|
|
||||||
|
#. Repeat the procedure for ``container.ring.gz`` and
|
||||||
|
``object.ring.gz``, and you might get usable builder files.
|
@ -93,6 +93,7 @@ Administrator Documentation
|
|||||||
replication_network
|
replication_network
|
||||||
logs
|
logs
|
||||||
ops_runbook/index
|
ops_runbook/index
|
||||||
|
admin/index
|
||||||
|
|
||||||
Object Storage v1 REST API Documentation
|
Object Storage v1 REST API Documentation
|
||||||
========================================
|
========================================
|
||||||
|