docs migration from openstack-manuals
Context for this is at https://specs.openstack.org/openstack/docs-specs/specs/pike/os-manuals-migration.html Change-Id: I9a4da27ce1d56b6406e2db979698038488f3cf6f
BIN
doc/source/admin/figures/objectstorage-accountscontainers.png
Normal file
After Width: | Height: | Size: 32 KiB |
BIN
doc/source/admin/figures/objectstorage-arch.png
Normal file
After Width: | Height: | Size: 56 KiB |
BIN
doc/source/admin/figures/objectstorage-buildingblocks.png
Normal file
After Width: | Height: | Size: 48 KiB |
BIN
doc/source/admin/figures/objectstorage-nodes.png
Normal file
After Width: | Height: | Size: 58 KiB |
BIN
doc/source/admin/figures/objectstorage-partitions.png
Normal file
After Width: | Height: | Size: 28 KiB |
BIN
doc/source/admin/figures/objectstorage-replication.png
Normal file
After Width: | Height: | Size: 45 KiB |
BIN
doc/source/admin/figures/objectstorage-ring.png
Normal file
After Width: | Height: | Size: 23 KiB |
BIN
doc/source/admin/figures/objectstorage-usecase.png
Normal file
After Width: | Height: | Size: 61 KiB |
BIN
doc/source/admin/figures/objectstorage-zones.png
Normal file
After Width: | Height: | Size: 10 KiB |
BIN
doc/source/admin/figures/objectstorage.png
Normal file
After Width: | Height: | Size: 23 KiB |
22
doc/source/admin/index.rst
Normal file
@ -0,0 +1,22 @@
|
||||
===================================
|
||||
OpenStack Swift Administrator Guide
|
||||
===================================
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
objectstorage-intro.rst
|
||||
objectstorage-features.rst
|
||||
objectstorage-characteristics.rst
|
||||
objectstorage-components.rst
|
||||
objectstorage-ringbuilder.rst
|
||||
objectstorage-arch.rst
|
||||
objectstorage-replication.rst
|
||||
objectstorage-large-objects.rst
|
||||
objectstorage-auditors.rst
|
||||
objectstorage-EC.rst
|
||||
objectstorage-account-reaper.rst
|
||||
objectstorage-tenant-specific-image-storage.rst
|
||||
objectstorage-monitoring.rst
|
||||
objectstorage-admin.rst
|
||||
objectstorage-troubleshoot.rst
|
31
doc/source/admin/objectstorage-EC.rst
Normal file
@ -0,0 +1,31 @@
|
||||
==============
|
||||
Erasure coding
|
||||
==============
|
||||
|
||||
Erasure coding is a set of algorithms that allows the reconstruction of
|
||||
missing data from a set of original data. In theory, erasure coding uses
|
||||
less capacity with similar durability characteristics as replicas.
|
||||
From an application perspective, erasure coding support is transparent.
|
||||
Object Storage (swift) implements erasure coding as a Storage Policy.
|
||||
See `Storage Policies
|
||||
<https://docs.openstack.org/developer/swift/overview_policies.html>`_
|
||||
for more details.
|
||||
|
||||
There is no external API related to erasure coding. Create a container using a
|
||||
Storage Policy; the interaction with the cluster is the same as any
|
||||
other durability policy. Because support implements as a Storage Policy,
|
||||
you can isolate all storage devices that associate with your cluster's
|
||||
erasure coding capability. It is entirely possible to share devices between
|
||||
storage policies, but for erasure coding it may make more sense to use
|
||||
not only separate devices but possibly even entire nodes dedicated for erasure
|
||||
coding.
|
||||
|
||||
.. important::
|
||||
|
||||
The erasure code support in Object Storage is considered beta in Kilo.
|
||||
Most major functionality is included, but it has not been tested or
|
||||
validated at large scale. This feature relies on ``ssync`` for durability.
|
||||
We recommend deployers do extensive testing and not deploy production
|
||||
data using an erasure code storage policy.
|
||||
If any bugs are found during testing, please report them to
|
||||
https://bugs.launchpad.net/swift
|
51
doc/source/admin/objectstorage-account-reaper.rst
Normal file
@ -0,0 +1,51 @@
|
||||
==============
|
||||
Account reaper
|
||||
==============
|
||||
|
||||
The purpose of the account reaper is to remove data from the deleted accounts.
|
||||
|
||||
A reseller marks an account for deletion by issuing a ``DELETE`` request
|
||||
on the account's storage URL. This action sets the ``status`` column of
|
||||
the account_stat table in the account database and replicas to
|
||||
``DELETED``, marking the account's data for deletion.
|
||||
|
||||
Typically, a specific retention time or undelete are not provided.
|
||||
However, you can set a ``delay_reaping`` value in the
|
||||
``[account-reaper]`` section of the ``account-server.conf`` file to
|
||||
delay the actual deletion of data. At this time, to undelete you have to update
|
||||
the account database replicas directly, set the status column to an
|
||||
empty string and update the put_timestamp to be greater than the
|
||||
delete_timestamp.
|
||||
|
||||
.. note::
|
||||
|
||||
It is on the development to-do list to write a utility that performs
|
||||
this task, preferably through a REST call.
|
||||
|
||||
The account reaper runs on each account server and scans the server
|
||||
occasionally for account databases marked for deletion. It only fires up
|
||||
on the accounts for which the server is the primary node, so that
|
||||
multiple account servers aren't trying to do it simultaneously. Using
|
||||
multiple servers to delete one account might improve the deletion speed
|
||||
but requires coordination to avoid duplication. Speed really is not a
|
||||
big concern with data deletion, and large accounts aren't deleted often.
|
||||
|
||||
Deleting an account is simple. For each account container, all objects
|
||||
are deleted and then the container is deleted. Deletion requests that
|
||||
fail will not stop the overall process but will cause the overall
|
||||
process to fail eventually (for example, if an object delete times out,
|
||||
you will not be able to delete the container or the account). The
|
||||
account reaper keeps trying to delete an account until it is empty, at
|
||||
which point the database reclaim process within the db\_replicator will
|
||||
remove the database files.
|
||||
|
||||
A persistent error state may prevent the deletion of an object or
|
||||
container. If this happens, you will see a message in the log, for example:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
Account <name> has not been reaped since <date>
|
||||
|
||||
You can control when this is logged with the ``reap_warn_after`` value in the
|
||||
``[account-reaper]`` section of the ``account-server.conf`` file.
|
||||
The default value is 30 days.
|
11
doc/source/admin/objectstorage-admin.rst
Normal file
@ -0,0 +1,11 @@
|
||||
========================================
|
||||
System administration for Object Storage
|
||||
========================================
|
||||
|
||||
By understanding Object Storage concepts, you can better monitor and
|
||||
administer your storage solution. The majority of the administration
|
||||
information is maintained in developer documentation at
|
||||
`docs.openstack.org/developer/swift/ <https://docs.openstack.org/developer/swift/>`__.
|
||||
|
||||
See the `OpenStack Configuration Reference <https://docs.openstack.org/ocata/config-reference/object-storage.html>`__
|
||||
for a list of configuration options for Object Storage.
|
88
doc/source/admin/objectstorage-arch.rst
Normal file
@ -0,0 +1,88 @@
|
||||
====================
|
||||
Cluster architecture
|
||||
====================
|
||||
|
||||
Access tier
|
||||
~~~~~~~~~~~
|
||||
Large-scale deployments segment off an access tier, which is considered
|
||||
the Object Storage system's central hub. The access tier fields the
|
||||
incoming API requests from clients and moves data in and out of the
|
||||
system. This tier consists of front-end load balancers, ssl-terminators,
|
||||
and authentication services. It runs the (distributed) brain of the
|
||||
Object Storage system: the proxy server processes.
|
||||
|
||||
.. note::
|
||||
|
||||
If you want to use OpenStack Identity API v3 for authentication, you
|
||||
have the following options available in ``/etc/swift/dispersion.conf``:
|
||||
``auth_version``, ``user_domain_name``, ``project_domain_name``,
|
||||
and ``project_name``.
|
||||
|
||||
**Object Storage architecture**
|
||||
|
||||
|
||||
.. figure:: figures/objectstorage-arch.png
|
||||
|
||||
|
||||
Because access servers are collocated in their own tier, you can scale
|
||||
out read/write access regardless of the storage capacity. For example,
|
||||
if a cluster is on the public Internet, requires SSL termination, and
|
||||
has a high demand for data access, you can provision many access
|
||||
servers. However, if the cluster is on a private network and used
|
||||
primarily for archival purposes, you need fewer access servers.
|
||||
|
||||
Since this is an HTTP addressable storage service, you may incorporate a
|
||||
load balancer into the access tier.
|
||||
|
||||
Typically, the tier consists of a collection of 1U servers. These
|
||||
machines use a moderate amount of RAM and are network I/O intensive.
|
||||
Since these systems field each incoming API request, you should
|
||||
provision them with two high-throughput (10GbE) interfaces - one for the
|
||||
incoming ``front-end`` requests and the other for the ``back-end`` access to
|
||||
the object storage nodes to put and fetch data.
|
||||
|
||||
Factors to consider
|
||||
-------------------
|
||||
|
||||
For most publicly facing deployments as well as private deployments
|
||||
available across a wide-reaching corporate network, you use SSL to
|
||||
encrypt traffic to the client. SSL adds significant processing load to
|
||||
establish sessions between clients, which is why you have to provision
|
||||
more capacity in the access layer. SSL may not be required for private
|
||||
deployments on trusted networks.
|
||||
|
||||
Storage nodes
|
||||
~~~~~~~~~~~~~
|
||||
|
||||
In most configurations, each of the five zones should have an equal
|
||||
amount of storage capacity. Storage nodes use a reasonable amount of
|
||||
memory and CPU. Metadata needs to be readily available to return objects
|
||||
quickly. The object stores run services not only to field incoming
|
||||
requests from the access tier, but to also run replicators, auditors,
|
||||
and reapers. You can provision object stores provisioned with single
|
||||
gigabit or 10 gigabit network interface depending on the expected
|
||||
workload and desired performance.
|
||||
|
||||
**Object Storage (swift)**
|
||||
|
||||
|
||||
.. figure:: figures/objectstorage-nodes.png
|
||||
|
||||
|
||||
|
||||
Currently, a 2 TB or 3 TB SATA disk delivers good performance for the
|
||||
price. You can use desktop-grade drives if you have responsive remote
|
||||
hands in the datacenter and enterprise-grade drives if you don't.
|
||||
|
||||
Factors to consider
|
||||
-------------------
|
||||
|
||||
You should keep in mind the desired I/O performance for single-threaded
|
||||
requests. This system does not use RAID, so a single disk handles each
|
||||
request for an object. Disk performance impacts single-threaded response
|
||||
rates.
|
||||
|
||||
To achieve apparent higher throughput, the object storage system is
|
||||
designed to handle concurrent uploads/downloads. The network I/O
|
||||
capacity (1GbE, bonded 1GbE pair, or 10GbE) should match your desired
|
||||
concurrent throughput needs for reads and writes.
|
30
doc/source/admin/objectstorage-auditors.rst
Normal file
@ -0,0 +1,30 @@
|
||||
==============
|
||||
Object Auditor
|
||||
==============
|
||||
|
||||
On system failures, the XFS file system can sometimes truncate files it is
|
||||
trying to write and produce zero-byte files. The object-auditor will catch
|
||||
these problems but in the case of a system crash it is advisable to run
|
||||
an extra, less rate limited sweep, to check for these specific files.
|
||||
You can run this command as follows:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ swift-object-auditor /path/to/object-server/config/file.conf once -z 1000
|
||||
|
||||
.. note::
|
||||
|
||||
"-z" means to only check for zero-byte files at 1000 files per second.
|
||||
|
||||
It is useful to run the object auditor on a specific device or set of devices.
|
||||
You can run the object-auditor once as follows:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ swift-object-auditor /path/to/object-server/config/file.conf once \
|
||||
--devices=sda,sdb
|
||||
|
||||
.. note::
|
||||
|
||||
This will run the object auditor on only the ``sda`` and ``sdb`` devices.
|
||||
This parameter accepts a comma-separated list of values.
|
43
doc/source/admin/objectstorage-characteristics.rst
Normal file
@ -0,0 +1,43 @@
|
||||
==============================
|
||||
Object Storage characteristics
|
||||
==============================
|
||||
|
||||
The key characteristics of Object Storage are that:
|
||||
|
||||
- All objects stored in Object Storage have a URL.
|
||||
|
||||
- All objects stored are replicated 3✕ in as-unique-as-possible zones,
|
||||
which can be defined as a group of drives, a node, a rack, and so on.
|
||||
|
||||
- All objects have their own metadata.
|
||||
|
||||
- Developers interact with the object storage system through a RESTful
|
||||
HTTP API.
|
||||
|
||||
- Object data can be located anywhere in the cluster.
|
||||
|
||||
- The cluster scales by adding additional nodes without sacrificing
|
||||
performance, which allows a more cost-effective linear storage
|
||||
expansion than fork-lift upgrades.
|
||||
|
||||
- Data does not have to be migrated to an entirely new storage system.
|
||||
|
||||
- New nodes can be added to the cluster without downtime.
|
||||
|
||||
- Failed nodes and disks can be swapped out without downtime.
|
||||
|
||||
- It runs on industry-standard hardware, such as Dell, HP, and
|
||||
Supermicro.
|
||||
|
||||
.. _objectstorage-figure:
|
||||
|
||||
Object Storage (swift)
|
||||
|
||||
.. figure:: figures/objectstorage.png
|
||||
|
||||
Developers can either write directly to the Swift API or use one of the
|
||||
many client libraries that exist for all of the popular programming
|
||||
languages, such as Java, Python, Ruby, and C#. Amazon S3 and RackSpace
|
||||
Cloud Files users should be very familiar with Object Storage. Users new
|
||||
to object storage systems will have to adjust to a different approach
|
||||
and mindset than those required for a traditional filesystem.
|
258
doc/source/admin/objectstorage-components.rst
Normal file
@ -0,0 +1,258 @@
|
||||
==========
|
||||
Components
|
||||
==========
|
||||
|
||||
Object Storage uses the following components to deliver high
|
||||
availability, high durability, and high concurrency:
|
||||
|
||||
- **Proxy servers** - Handle all of the incoming API requests.
|
||||
|
||||
- **Rings** - Map logical names of data to locations on particular
|
||||
disks.
|
||||
|
||||
- **Zones** - Isolate data from other zones. A failure in one zone
|
||||
does not impact the rest of the cluster as data replicates
|
||||
across zones.
|
||||
|
||||
- **Accounts and containers** - Each account and container are
|
||||
individual databases that are distributed across the cluster. An
|
||||
account database contains the list of containers in that account. A
|
||||
container database contains the list of objects in that container.
|
||||
|
||||
- **Objects** - The data itself.
|
||||
|
||||
- **Partitions** - A partition stores objects, account databases, and
|
||||
container databases and helps manage locations where data lives in
|
||||
the cluster.
|
||||
|
||||
|
||||
.. _objectstorage-building-blocks-figure:
|
||||
|
||||
**Object Storage building blocks**
|
||||
|
||||
.. figure:: figures/objectstorage-buildingblocks.png
|
||||
|
||||
|
||||
Proxy servers
|
||||
-------------
|
||||
|
||||
Proxy servers are the public face of Object Storage and handle all of
|
||||
the incoming API requests. Once a proxy server receives a request, it
|
||||
determines the storage node based on the object's URL, for example:
|
||||
https://swift.example.com/v1/account/container/object. Proxy servers
|
||||
also coordinate responses, handle failures, and coordinate timestamps.
|
||||
|
||||
Proxy servers use a shared-nothing architecture and can be scaled as
|
||||
needed based on projected workloads. A minimum of two proxy servers
|
||||
should be deployed for redundancy. If one proxy server fails, the others
|
||||
take over.
|
||||
|
||||
For more information concerning proxy server configuration, see
|
||||
`Configuration Reference
|
||||
<https://docs.openstack.org/ocata/config-reference/object-storage/proxy-server.html>`_.
|
||||
|
||||
Rings
|
||||
-----
|
||||
|
||||
A ring represents a mapping between the names of entities stored on disks
|
||||
and their physical locations. There are separate rings for accounts,
|
||||
containers, and objects. When other components need to perform any
|
||||
operation on an object, container, or account, they need to interact
|
||||
with the appropriate ring to determine their location in the cluster.
|
||||
|
||||
The ring maintains this mapping using zones, devices, partitions, and
|
||||
replicas. Each partition in the ring is replicated, by default, three
|
||||
times across the cluster, and partition locations are stored in the
|
||||
mapping maintained by the ring. The ring is also responsible for
|
||||
determining which devices are used for handoff in failure scenarios.
|
||||
|
||||
Data can be isolated into zones in the ring. Each partition replica is
|
||||
guaranteed to reside in a different zone. A zone could represent a
|
||||
drive, a server, a cabinet, a switch, or even a data center.
|
||||
|
||||
The partitions of the ring are equally divided among all of the devices
|
||||
in the Object Storage installation. When partitions need to be moved
|
||||
around (for example, if a device is added to the cluster), the ring
|
||||
ensures that a minimum number of partitions are moved at a time, and
|
||||
only one replica of a partition is moved at a time.
|
||||
|
||||
You can use weights to balance the distribution of partitions on drives
|
||||
across the cluster. This can be useful, for example, when differently
|
||||
sized drives are used in a cluster.
|
||||
|
||||
The ring is used by the proxy server and several background processes
|
||||
(like replication).
|
||||
|
||||
|
||||
.. _objectstorage-ring-figure:
|
||||
|
||||
**The ring**
|
||||
|
||||
.. figure:: figures/objectstorage-ring.png
|
||||
|
||||
These rings are externally managed. The server processes themselves
|
||||
do not modify the rings, they are instead given new rings modified by
|
||||
other tools.
|
||||
|
||||
The ring uses a configurable number of bits from an ``MD5`` hash for a path
|
||||
as a partition index that designates a device. The number of bits kept
|
||||
from the hash is known as the partition power, and 2 to the partition
|
||||
power indicates the partition count. Partitioning the full ``MD5`` hash ring
|
||||
allows other parts of the cluster to work in batches of items at once
|
||||
which ends up either more efficient or at least less complex than
|
||||
working with each item separately or the entire cluster all at once.
|
||||
|
||||
Another configurable value is the replica count, which indicates how
|
||||
many of the partition-device assignments make up a single ring. For a
|
||||
given partition number, each replica's device will not be in the same
|
||||
zone as any other replica's device. Zones can be used to group devices
|
||||
based on physical locations, power separations, network separations, or
|
||||
any other attribute that would improve the availability of multiple
|
||||
replicas at the same time.
|
||||
|
||||
Zones
|
||||
-----
|
||||
|
||||
Object Storage allows configuring zones in order to isolate failure
|
||||
boundaries. If possible, each data replica resides in a separate zone.
|
||||
At the smallest level, a zone could be a single drive or a grouping of a
|
||||
few drives. If there were five object storage servers, then each server
|
||||
would represent its own zone. Larger deployments would have an entire
|
||||
rack (or multiple racks) of object servers, each representing a zone.
|
||||
The goal of zones is to allow the cluster to tolerate significant
|
||||
outages of storage servers without losing all replicas of the data.
|
||||
|
||||
|
||||
.. _objectstorage-zones-figure:
|
||||
|
||||
**Zones**
|
||||
|
||||
.. figure:: figures/objectstorage-zones.png
|
||||
|
||||
|
||||
Accounts and containers
|
||||
-----------------------
|
||||
|
||||
Each account and container is an individual SQLite database that is
|
||||
distributed across the cluster. An account database contains the list of
|
||||
containers in that account. A container database contains the list of
|
||||
objects in that container.
|
||||
|
||||
|
||||
.. _objectstorage-accountscontainers-figure:
|
||||
|
||||
**Accounts and containers**
|
||||
|
||||
.. figure:: figures/objectstorage-accountscontainers.png
|
||||
|
||||
|
||||
To keep track of object data locations, each account in the system has a
|
||||
database that references all of its containers, and each container
|
||||
database references each object.
|
||||
|
||||
Partitions
|
||||
----------
|
||||
|
||||
A partition is a collection of stored data. This includes account databases,
|
||||
container databases, and objects. Partitions are core to the replication
|
||||
system.
|
||||
|
||||
Think of a partition as a bin moving throughout a fulfillment center
|
||||
warehouse. Individual orders get thrown into the bin. The system treats
|
||||
that bin as a cohesive entity as it moves throughout the system. A bin
|
||||
is easier to deal with than many little things. It makes for fewer
|
||||
moving parts throughout the system.
|
||||
|
||||
System replicators and object uploads/downloads operate on partitions.
|
||||
As the system scales up, its behavior continues to be predictable
|
||||
because the number of partitions is a fixed number.
|
||||
|
||||
Implementing a partition is conceptually simple, a partition is just a
|
||||
directory sitting on a disk with a corresponding hash table of what it
|
||||
contains.
|
||||
|
||||
|
||||
.. _objectstorage-partitions-figure:
|
||||
|
||||
**Partitions**
|
||||
|
||||
.. figure:: figures/objectstorage-partitions.png
|
||||
|
||||
|
||||
Replicators
|
||||
-----------
|
||||
|
||||
In order to ensure that there are three copies of the data everywhere,
|
||||
replicators continuously examine each partition. For each local
|
||||
partition, the replicator compares it against the replicated copies in
|
||||
the other zones to see if there are any differences.
|
||||
|
||||
The replicator knows if replication needs to take place by examining
|
||||
hashes. A hash file is created for each partition, which contains hashes
|
||||
of each directory in the partition. Each of the three hash files is
|
||||
compared. For a given partition, the hash files for each of the
|
||||
partition's copies are compared. If the hashes are different, then it is
|
||||
time to replicate, and the directory that needs to be replicated is
|
||||
copied over.
|
||||
|
||||
This is where partitions come in handy. With fewer things in the system,
|
||||
larger chunks of data are transferred around (rather than lots of little
|
||||
TCP connections, which is inefficient) and there is a consistent number
|
||||
of hashes to compare.
|
||||
|
||||
The cluster eventually has a consistent behavior where the newest data
|
||||
has a priority.
|
||||
|
||||
|
||||
.. _objectstorage-replication-figure:
|
||||
|
||||
**Replication**
|
||||
|
||||
.. figure:: figures/objectstorage-replication.png
|
||||
|
||||
|
||||
If a zone goes down, one of the nodes containing a replica notices and
|
||||
proactively copies data to a handoff location.
|
||||
|
||||
Use cases
|
||||
---------
|
||||
|
||||
The following sections show use cases for object uploads and downloads
|
||||
and introduce the components.
|
||||
|
||||
|
||||
Upload
|
||||
~~~~~~
|
||||
|
||||
A client uses the REST API to make a HTTP request to PUT an object into
|
||||
an existing container. The cluster receives the request. First, the
|
||||
system must figure out where the data is going to go. To do this, the
|
||||
account name, container name, and object name are all used to determine
|
||||
the partition where this object should live.
|
||||
|
||||
Then a lookup in the ring figures out which storage nodes contain the
|
||||
partitions in question.
|
||||
|
||||
The data is then sent to each storage node where it is placed in the
|
||||
appropriate partition. At least two of the three writes must be
|
||||
successful before the client is notified that the upload was successful.
|
||||
|
||||
Next, the container database is updated asynchronously to reflect that
|
||||
there is a new object in it.
|
||||
|
||||
|
||||
.. _objectstorage-usecase-figure:
|
||||
|
||||
**Object Storage in use**
|
||||
|
||||
.. figure:: figures/objectstorage-usecase.png
|
||||
|
||||
|
||||
Download
|
||||
~~~~~~~~
|
||||
|
||||
A request comes in for an account/container/object. Using the same
|
||||
consistent hashing, the partition name is generated. A lookup in the
|
||||
ring reveals which storage nodes contain that partition. A request is
|
||||
made to one of the storage nodes to fetch the object and, if that fails,
|
||||
requests are made to the other nodes.
|
63
doc/source/admin/objectstorage-features.rst
Normal file
@ -0,0 +1,63 @@
|
||||
=====================
|
||||
Features and benefits
|
||||
=====================
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
:widths: 10 40
|
||||
|
||||
* - Features
|
||||
- Benefits
|
||||
* - Leverages commodity hardware
|
||||
- No lock-in, lower price/GB.
|
||||
* - HDD/node failure agnostic
|
||||
- Self-healing, reliable, data redundancy protects from failures.
|
||||
* - Unlimited storage
|
||||
- Large and flat namespace, highly scalable read/write access,
|
||||
able to serve content directly from storage system.
|
||||
* - Multi-dimensional scalability
|
||||
- Scale-out architecture: Scale vertically and
|
||||
horizontally-distributed storage. Backs up and archives large
|
||||
amounts of data with linear performance.
|
||||
* - Account/container/object structure
|
||||
- No nesting, not a traditional file system: Optimized for scale,
|
||||
it scales to multiple petabytes and billions of objects.
|
||||
* - Built-in replication 3✕ + data redundancy (compared with 2✕ on
|
||||
RAID)
|
||||
- A configurable number of accounts, containers and object copies
|
||||
for high availability.
|
||||
* - Easily add capacity (unlike RAID resize)
|
||||
- Elastic data scaling with ease.
|
||||
* - No central database
|
||||
- Higher performance, no bottlenecks.
|
||||
* - RAID not required
|
||||
- Handle many small, random reads and writes efficiently.
|
||||
* - Built-in management utilities
|
||||
- Account management: Create, add, verify, and delete users;
|
||||
Container management: Upload, download, and verify; Monitoring:
|
||||
Capacity, host, network, log trawling, and cluster health.
|
||||
* - Drive auditing
|
||||
- Detect drive failures preempting data corruption.
|
||||
* - Expiring objects
|
||||
- Users can set an expiration time or a TTL on an object to
|
||||
control access.
|
||||
* - Direct object access
|
||||
- Enable direct browser access to content, such as for a control
|
||||
panel.
|
||||
* - Realtime visibility into client requests
|
||||
- Know what users are requesting.
|
||||
* - Supports S3 API
|
||||
- Utilize tools that were designed for the popular S3 API.
|
||||
* - Restrict containers per account
|
||||
- Limit access to control usage by user.
|
||||
* - Support for NetApp, Nexenta, Solidfire
|
||||
- Unified support for block volumes using a variety of storage
|
||||
systems.
|
||||
* - Snapshot and backup API for block volumes.
|
||||
- Data protection and recovery for VM data.
|
||||
* - Standalone volume API available
|
||||
- Separate endpoint and API for integration with other compute
|
||||
systems.
|
||||
* - Integration with Compute
|
||||
- Fully integrated with Compute for attaching block volumes and
|
||||
reporting on usage.
|
23
doc/source/admin/objectstorage-intro.rst
Normal file
@ -0,0 +1,23 @@
|
||||
==============================
|
||||
Introduction to Object Storage
|
||||
==============================
|
||||
|
||||
OpenStack Object Storage (swift) is used for redundant, scalable data
|
||||
storage using clusters of standardized servers to store petabytes of
|
||||
accessible data. It is a long-term storage system for large amounts of
|
||||
static data which can be retrieved and updated. Object Storage uses a
|
||||
distributed architecture
|
||||
with no central point of control, providing greater scalability,
|
||||
redundancy, and permanence. Objects are written to multiple hardware
|
||||
devices, with the OpenStack software responsible for ensuring data
|
||||
replication and integrity across the cluster. Storage clusters scale
|
||||
horizontally by adding new nodes. Should a node fail, OpenStack works to
|
||||
replicate its content from other active nodes. Because OpenStack uses
|
||||
software logic to ensure data replication and distribution across
|
||||
different devices, inexpensive commodity hard drives and servers can be
|
||||
used in lieu of more expensive equipment.
|
||||
|
||||
Object Storage is ideal for cost effective, scale-out storage. It
|
||||
provides a fully distributed, API-accessible storage platform that can
|
||||
be integrated directly into applications or used for backup, archiving,
|
||||
and data retention.
|
35
doc/source/admin/objectstorage-large-objects.rst
Normal file
@ -0,0 +1,35 @@
|
||||
====================
|
||||
Large object support
|
||||
====================
|
||||
|
||||
Object Storage (swift) uses segmentation to support the upload of large
|
||||
objects. By default, Object Storage limits the download size of a single
|
||||
object to 5GB. Using segmentation, uploading a single object is virtually
|
||||
unlimited. The segmentation process works by fragmenting the object,
|
||||
and automatically creating a file that sends the segments together as
|
||||
a single object. This option offers greater upload speed with the possibility
|
||||
of parallel uploads.
|
||||
|
||||
Large objects
|
||||
~~~~~~~~~~~~~
|
||||
The large object is comprised of two types of objects:
|
||||
|
||||
- **Segment objects** store the object content. You can divide your
|
||||
content into segments, and upload each segment into its own segment
|
||||
object. Segment objects do not have any special features. You create,
|
||||
update, download, and delete segment objects just as you would normal
|
||||
objects.
|
||||
|
||||
- A **manifest object** links the segment objects into one logical
|
||||
large object. When you download a manifest object, Object Storage
|
||||
concatenates and returns the contents of the segment objects in the
|
||||
response body of the request. The manifest object types are:
|
||||
|
||||
- **Static large objects**
|
||||
- **Dynamic large objects**
|
||||
|
||||
To find out more information on large object support, see `Large objects
|
||||
<https://docs.openstack.org/user-guide/cli-swift-large-object-creation.html>`_
|
||||
in the OpenStack End User Guide, or `Large Object Support
|
||||
<https://docs.openstack.org/developer/swift/overview_large_objects.html>`_
|
||||
in the developer documentation.
|
228
doc/source/admin/objectstorage-monitoring.rst
Normal file
@ -0,0 +1,228 @@
|
||||
=========================
|
||||
Object Storage monitoring
|
||||
=========================
|
||||
|
||||
.. note::
|
||||
|
||||
This section was excerpted from a blog post by `Darrell
|
||||
Bishop <http://swiftstack.com/blog/2012/04/11/swift-monitoring-with-statsd>`_ and
|
||||
has since been edited.
|
||||
|
||||
An OpenStack Object Storage cluster is a collection of many daemons that
|
||||
work together across many nodes. With so many different components, you
|
||||
must be able to tell what is going on inside the cluster. Tracking
|
||||
server-level meters like CPU utilization, load, memory consumption, disk
|
||||
usage and utilization, and so on is necessary, but not sufficient.
|
||||
|
||||
Swift Recon
|
||||
~~~~~~~~~~~
|
||||
|
||||
The Swift Recon middleware (see
|
||||
`Defining Storage Policies <https://docs.openstack.org/developer/swift/admin_guide.html#cluster-telemetry-and-monitoring>`_)
|
||||
provides general machine statistics, such as load average, socket
|
||||
statistics, ``/proc/meminfo`` contents, as well as Swift-specific meters:
|
||||
|
||||
- The ``MD5`` sum of each ring file.
|
||||
|
||||
- The most recent object replication time.
|
||||
|
||||
- Count of each type of quarantined file: Account, container, or
|
||||
object.
|
||||
|
||||
- Count of "async_pendings" (deferred container updates) on disk.
|
||||
|
||||
Swift Recon is middleware that is installed in the object servers
|
||||
pipeline and takes one required option: A local cache directory. To
|
||||
track ``async_pendings``, you must set up an additional cron job for
|
||||
each object server. You access data by either sending HTTP requests
|
||||
directly to the object server or using the ``swift-recon`` command-line
|
||||
client.
|
||||
|
||||
There are Object Storage cluster statistics but the typical
|
||||
server meters overlap with existing server monitoring systems. To get
|
||||
the Swift-specific meters into a monitoring system, they must be polled.
|
||||
Swift Recon acts as a middleware meters collector. The
|
||||
process that feeds meters to your statistics system, such as
|
||||
``collectd`` and ``gmond``, should already run on the storage node.
|
||||
You can choose to either talk to Swift Recon or collect the meters
|
||||
directly.
|
||||
|
||||
Swift-Informant
|
||||
~~~~~~~~~~~~~~~
|
||||
|
||||
Swift-Informant middleware (see
|
||||
`swift-informant <https://github.com/pandemicsyn/swift-informant>`_) has
|
||||
real-time visibility into Object Storage client requests. It sits in the
|
||||
pipeline for the proxy server, and after each request to the proxy server it
|
||||
sends three meters to a ``StatsD`` server:
|
||||
|
||||
- A counter increment for a meter like ``obj.GET.200`` or
|
||||
``cont.PUT.404``.
|
||||
|
||||
- Timing data for a meter like ``acct.GET.200`` or ``obj.GET.200``.
|
||||
[The README says the meters look like ``duration.acct.GET.200``, but
|
||||
I do not see the ``duration`` in the code. I am not sure what the
|
||||
Etsy server does but our StatsD server turns timing meters into five
|
||||
derivative meters with new segments appended, so it probably works as
|
||||
coded. The first meter turns into ``acct.GET.200.lower``,
|
||||
``acct.GET.200.upper``, ``acct.GET.200.mean``,
|
||||
``acct.GET.200.upper_90``, and ``acct.GET.200.count``].
|
||||
|
||||
- A counter increase by the bytes transferred for a meter like
|
||||
``tfer.obj.PUT.201``.
|
||||
|
||||
This is used for receiving information on the quality of service clients
|
||||
experience with the timing meters, as well as sensing the volume of the
|
||||
various modifications of a request server type, command, and response
|
||||
code. Swift-Informant requires no change to core Object
|
||||
Storage code because it is implemented as middleware. However, it gives
|
||||
no insight into the workings of the cluster past the proxy server.
|
||||
If the responsiveness of one storage node degrades, you can only see
|
||||
that some of the requests are bad, either as high latency or error
|
||||
status codes.
|
||||
|
||||
Statsdlog
|
||||
~~~~~~~~~
|
||||
|
||||
The `Statsdlog <https://github.com/pandemicsyn/statsdlog>`_
|
||||
project increments StatsD counters based on logged events. Like
|
||||
Swift-Informant, it is also non-intrusive, however statsdlog can track
|
||||
events from all Object Storage daemons, not just proxy-server. The
|
||||
daemon listens to a UDP stream of syslog messages, and StatsD counters
|
||||
are incremented when a log line matches a regular expression. Meter
|
||||
names are mapped to regex match patterns in a JSON file, allowing
|
||||
flexible configuration of what meters are extracted from the log stream.
|
||||
|
||||
Currently, only the first matching regex triggers a StatsD counter
|
||||
increment, and the counter is always incremented by one. There is no way
|
||||
to increment a counter by more than one or send timing data to StatsD
|
||||
based on the log line content. The tool could be extended to handle more
|
||||
meters for each line and data extraction, including timing data. But a
|
||||
coupling would still exist between the log textual format and the log
|
||||
parsing regexes, which would themselves be more complex to support
|
||||
multiple matches for each line and data extraction. Also, log processing
|
||||
introduces a delay between the triggering event and sending the data to
|
||||
StatsD. It would be preferable to increment error counters where they
|
||||
occur and send timing data as soon as it is known to avoid coupling
|
||||
between a log string and a parsing regex and prevent a time delay
|
||||
between events and sending data to StatsD.
|
||||
|
||||
The next section describes another method for gathering Object Storage
|
||||
operational meters.
|
||||
|
||||
Swift StatsD logging
|
||||
~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
StatsD (see `Measure Anything, Measure Everything
|
||||
<http://codeascraft.etsy.com/2011/02/15/measure-anything-measure-everything/>`_)
|
||||
was designed for application code to be deeply instrumented. Meters are
|
||||
sent in real-time by the code that just noticed or did something. The
|
||||
overhead of sending a meter is extremely low: a ``sendto`` of one UDP
|
||||
packet. If that overhead is still too high, the StatsD client library
|
||||
can send only a random portion of samples and StatsD approximates the
|
||||
actual number when flushing meters upstream.
|
||||
|
||||
To avoid the problems inherent with middleware-based monitoring and
|
||||
after-the-fact log processing, the sending of StatsD meters is
|
||||
integrated into Object Storage itself. The submitted change set (see
|
||||
`<https://review.openstack.org/#change,6058>`_) currently reports 124 meters
|
||||
across 15 Object Storage daemons and the tempauth middleware. Details of
|
||||
the meters tracked are in the `Administrator's
|
||||
Guide <https://docs.openstack.org/developer/swift/admin_guide.html>`_.
|
||||
|
||||
The sending of meters is integrated with the logging framework. To
|
||||
enable, configure ``log_statsd_host`` in the relevant config file. You
|
||||
can also specify the port and a default sample rate. The specified
|
||||
default sample rate is used unless a specific call to a statsd logging
|
||||
method (see the list below) overrides it. Currently, no logging calls
|
||||
override the sample rate, but it is conceivable that some meters may
|
||||
require accuracy (``sample_rate=1``) while others may not.
|
||||
|
||||
.. code-block:: ini
|
||||
|
||||
[DEFAULT]
|
||||
# ...
|
||||
log_statsd_host = 127.0.0.1
|
||||
log_statsd_port = 8125
|
||||
log_statsd_default_sample_rate = 1
|
||||
|
||||
Then the LogAdapter object returned by ``get_logger()``, usually stored
|
||||
in ``self.logger``, has these new methods:
|
||||
|
||||
- ``set_statsd_prefix(self, prefix)`` Sets the client library stat
|
||||
prefix value which gets prefixed to every meter. The default prefix
|
||||
is the ``name`` of the logger such as ``object-server``,
|
||||
``container-auditor``, and so on. This is currently used to turn
|
||||
``proxy-server`` into one of ``proxy-server.Account``,
|
||||
``proxy-server.Container``, or ``proxy-server.Object`` as soon as the
|
||||
Controller object is determined and instantiated for the request.
|
||||
|
||||
- ``update_stats(self, metric, amount, sample_rate=1)`` Increments
|
||||
the supplied meter by the given amount. This is used when you need
|
||||
to add or subtract more that one from a counter, like incrementing
|
||||
``suffix.hashes`` by the number of computed hashes in the object
|
||||
replicator.
|
||||
|
||||
- ``increment(self, metric, sample_rate=1)`` Increments the given counter
|
||||
meter by one.
|
||||
|
||||
- ``decrement(self, metric, sample_rate=1)`` Lowers the given counter
|
||||
meter by one.
|
||||
|
||||
- ``timing(self, metric, timing_ms, sample_rate=1)`` Record that the
|
||||
given meter took the supplied number of milliseconds.
|
||||
|
||||
- ``timing_since(self, metric, orig_time, sample_rate=1)``
|
||||
Convenience method to record a timing meter whose value is "now"
|
||||
minus an existing timestamp.
|
||||
|
||||
.. note::
|
||||
|
||||
These logging methods may safely be called anywhere you have a
|
||||
logger object. If StatsD logging has not been configured, the methods
|
||||
are no-ops. This avoids messy conditional logic each place a meter is
|
||||
recorded. These example usages show the new logging methods:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
# swift/obj/replicator.py
|
||||
def update(self, job):
|
||||
# ...
|
||||
begin = time.time()
|
||||
try:
|
||||
hashed, local_hash = tpool.execute(tpooled_get_hashes, job['path'],
|
||||
do_listdir=(self.replication_count % 10) == 0,
|
||||
reclaim_age=self.reclaim_age)
|
||||
# See tpooled_get_hashes "Hack".
|
||||
if isinstance(hashed, BaseException):
|
||||
raise hashed
|
||||
self.suffix_hash += hashed
|
||||
self.logger.update_stats('suffix.hashes', hashed)
|
||||
# ...
|
||||
finally:
|
||||
self.partition_times.append(time.time() - begin)
|
||||
self.logger.timing_since('partition.update.timing', begin)
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
# swift/container/updater.py
|
||||
def process_container(self, dbfile):
|
||||
# ...
|
||||
start_time = time.time()
|
||||
# ...
|
||||
for event in events:
|
||||
if 200 <= event.wait() < 300:
|
||||
successes += 1
|
||||
else:
|
||||
failures += 1
|
||||
if successes > failures:
|
||||
self.logger.increment('successes')
|
||||
# ...
|
||||
else:
|
||||
self.logger.increment('failures')
|
||||
# ...
|
||||
# Only track timing data for attempted updates:
|
||||
self.logger.timing_since('timing', start_time)
|
||||
else:
|
||||
self.logger.increment('no_changes')
|
||||
self.no_changes += 1
|
98
doc/source/admin/objectstorage-replication.rst
Normal file
@ -0,0 +1,98 @@
|
||||
===========
|
||||
Replication
|
||||
===========
|
||||
|
||||
Because each replica in Object Storage functions independently and
|
||||
clients generally require only a simple majority of nodes to respond to
|
||||
consider an operation successful, transient failures like network
|
||||
partitions can quickly cause replicas to diverge. These differences are
|
||||
eventually reconciled by asynchronous, peer-to-peer replicator
|
||||
processes. The replicator processes traverse their local file systems
|
||||
and concurrently perform operations in a manner that balances load
|
||||
across physical disks.
|
||||
|
||||
Replication uses a push model, with records and files generally only
|
||||
being copied from local to remote replicas. This is important because
|
||||
data on the node might not belong there (as in the case of hand offs and
|
||||
ring changes), and a replicator cannot know which data it should pull in
|
||||
from elsewhere in the cluster. Any node that contains data must ensure
|
||||
that data gets to where it belongs. The ring handles replica placement.
|
||||
|
||||
To replicate deletions in addition to creations, every deleted record or
|
||||
file in the system is marked by a tombstone. The replication process
|
||||
cleans up tombstones after a time period known as the ``consistency
|
||||
window``. This window defines the duration of the replication and how
|
||||
long transient failure can remove a node from the cluster. Tombstone
|
||||
cleanup must be tied to replication to reach replica convergence.
|
||||
|
||||
If a replicator detects that a remote drive has failed, the replicator
|
||||
uses the ``get_more_nodes`` interface for the ring to choose an
|
||||
alternate node with which to synchronize. The replicator can maintain
|
||||
desired levels of replication during disk failures, though some replicas
|
||||
might not be in an immediately usable location.
|
||||
|
||||
.. note::
|
||||
|
||||
The replicator does not maintain desired levels of replication when
|
||||
failures such as entire node failures occur; most failures are
|
||||
transient.
|
||||
|
||||
The main replication types are:
|
||||
|
||||
- Database replication
|
||||
Replicates containers and objects.
|
||||
|
||||
- Object replication
|
||||
Replicates object data.
|
||||
|
||||
Database replication
|
||||
~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Database replication completes a low-cost hash comparison to determine
|
||||
whether two replicas already match. Normally, this check can quickly
|
||||
verify that most databases in the system are already synchronized. If
|
||||
the hashes differ, the replicator synchronizes the databases by sharing
|
||||
records added since the last synchronization point.
|
||||
|
||||
This synchronization point is a high water mark that notes the last
|
||||
record at which two databases were known to be synchronized, and is
|
||||
stored in each database as a tuple of the remote database ID and record
|
||||
ID. Database IDs are unique across all replicas of the database, and
|
||||
record IDs are monotonically increasing integers. After all new records
|
||||
are pushed to the remote database, the entire synchronization table of
|
||||
the local database is pushed, so the remote database can guarantee that
|
||||
it is synchronized with everything with which the local database was
|
||||
previously synchronized.
|
||||
|
||||
If a replica is missing, the whole local database file is transmitted to
|
||||
the peer by using rsync(1) and is assigned a new unique ID.
|
||||
|
||||
In practice, database replication can process hundreds of databases per
|
||||
concurrency setting per second (up to the number of available CPUs or
|
||||
disks) and is bound by the number of database transactions that must be
|
||||
performed.
|
||||
|
||||
Object replication
|
||||
~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The initial implementation of object replication performed an rsync to
|
||||
push data from a local partition to all remote servers where it was
|
||||
expected to reside. While this worked at small scale, replication times
|
||||
skyrocketed once directory structures could no longer be held in RAM.
|
||||
This scheme was modified to save a hash of the contents for each suffix
|
||||
directory to a per-partition hashes file. The hash for a suffix
|
||||
directory is no longer valid when the contents of that suffix directory
|
||||
is modified.
|
||||
|
||||
The object replication process reads in hash files and calculates any
|
||||
invalidated hashes. Then, it transmits the hashes to each remote server
|
||||
that should hold the partition, and only suffix directories with
|
||||
differing hashes on the remote server are rsynced. After pushing files
|
||||
to the remote server, the replication process notifies it to recalculate
|
||||
hashes for the rsynced suffix directories.
|
||||
|
||||
The number of uncached directories that object replication must
|
||||
traverse, usually as a result of invalidated suffix directory hashes,
|
||||
impedes performance. To provide acceptable replication speeds, object
|
||||
replication is designed to invalidate around 2 percent of the hash space
|
||||
on a normal node each day.
|
228
doc/source/admin/objectstorage-ringbuilder.rst
Normal file
@ -0,0 +1,228 @@
|
||||
============
|
||||
Ring-builder
|
||||
============
|
||||
|
||||
Use the swift-ring-builder utility to build and manage rings. This
|
||||
utility assigns partitions to devices and writes an optimized Python
|
||||
structure to a gzipped, serialized file on disk for transmission to the
|
||||
servers. The server processes occasionally check the modification time
|
||||
of the file and reload in-memory copies of the ring structure as needed.
|
||||
If you use a slightly older version of the ring, one of the three
|
||||
replicas for a partition subset will be incorrect because of the way the
|
||||
ring-builder manages changes to the ring. You can work around this
|
||||
issue.
|
||||
|
||||
The ring-builder also keeps its own builder file with the ring
|
||||
information and additional data required to build future rings. It is
|
||||
very important to keep multiple backup copies of these builder files.
|
||||
One option is to copy the builder files out to every server while
|
||||
copying the ring files themselves. Another is to upload the builder
|
||||
files into the cluster itself. If you lose the builder file, you have to
|
||||
create a new ring from scratch. Nearly all partitions would be assigned
|
||||
to different devices and, therefore, nearly all of the stored data would
|
||||
have to be replicated to new locations. So, recovery from a builder file
|
||||
loss is possible, but data would be unreachable for an extended time.
|
||||
|
||||
Ring data structure
|
||||
~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The ring data structure consists of three top level fields: a list of
|
||||
devices in the cluster, a list of lists of device ids indicating
|
||||
partition to device assignments, and an integer indicating the number of
|
||||
bits to shift an MD5 hash to calculate the partition for the hash.
|
||||
|
||||
Partition assignment list
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
This is a list of ``array('H')`` of devices ids. The outermost list
|
||||
contains an ``array('H')`` for each replica. Each ``array('H')`` has a
|
||||
length equal to the partition count for the ring. Each integer in the
|
||||
``array('H')`` is an index into the above list of devices. The partition
|
||||
list is known internally to the Ring class as ``_replica2part2dev_id``.
|
||||
|
||||
So, to create a list of device dictionaries assigned to a partition, the
|
||||
Python code would look like:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
devices = [self.devs[part2dev_id[partition]] for
|
||||
part2dev_id in self._replica2part2dev_id]
|
||||
|
||||
That code is a little simplistic because it does not account for the
|
||||
removal of duplicate devices. If a ring has more replicas than devices,
|
||||
a partition will have more than one replica on a device.
|
||||
|
||||
``array('H')`` is used for memory conservation as there may be millions
|
||||
of partitions.
|
||||
|
||||
Overload
|
||||
~~~~~~~~
|
||||
|
||||
The ring builder tries to keep replicas as far apart as possible while
|
||||
still respecting device weights. When it can not do both, the overload
|
||||
factor determines what happens. Each device takes an extra
|
||||
fraction of its desired partitions to allow for replica dispersion;
|
||||
after that extra fraction is exhausted, replicas are placed closer
|
||||
together than optimal.
|
||||
|
||||
The overload factor lets the operator trade off replica
|
||||
dispersion (durability) against data dispersion (uniform disk usage).
|
||||
|
||||
The default overload factor is 0, so device weights are strictly
|
||||
followed.
|
||||
|
||||
With an overload factor of 0.1, each device accepts 10% more
|
||||
partitions than it otherwise would, but only if it needs to maintain
|
||||
partition dispersion.
|
||||
|
||||
For example, consider a 3-node cluster of machines with equal-size disks;
|
||||
node A has 12 disks, node B has 12 disks, and node C has
|
||||
11 disks. The ring has an overload factor of 0.1 (10%).
|
||||
|
||||
Without the overload, some partitions would end up with replicas only
|
||||
on nodes A and B. However, with the overload, every device can accept
|
||||
up to 10% more partitions for the sake of dispersion. The
|
||||
missing disk in C means there is one disk's worth of partitions
|
||||
to spread across the remaining 11 disks, which gives each
|
||||
disk in C an extra 9.09% load. Since this is less than the 10%
|
||||
overload, there is one replica of each partition on each node.
|
||||
|
||||
However, this does mean that the disks in node C have more data
|
||||
than the disks in nodes A and B. If 80% full is the warning
|
||||
threshold for the cluster, node C's disks reach 80% full while A
|
||||
and B's disks are only 72.7% full.
|
||||
|
||||
|
||||
Replica counts
|
||||
~~~~~~~~~~~~~~
|
||||
|
||||
To support the gradual change in replica counts, a ring can have a real
|
||||
number of replicas and is not restricted to an integer number of
|
||||
replicas.
|
||||
|
||||
A fractional replica count is for the whole ring and not for individual
|
||||
partitions. It indicates the average number of replicas for each
|
||||
partition. For example, a replica count of 3.2 means that 20 percent of
|
||||
partitions have four replicas and 80 percent have three replicas.
|
||||
|
||||
The replica count is adjustable. For example:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ swift-ring-builder account.builder set_replicas 4
|
||||
$ swift-ring-builder account.builder rebalance
|
||||
|
||||
You must rebalance the replica ring in globally distributed clusters.
|
||||
Operators of these clusters generally want an equal number of replicas
|
||||
and regions. Therefore, when an operator adds or removes a region, the
|
||||
operator adds or removes a replica. Removing unneeded replicas saves on
|
||||
the cost of disks.
|
||||
|
||||
You can gradually increase the replica count at a rate that does not
|
||||
adversely affect cluster performance. For example:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ swift-ring-builder object.builder set_replicas 3.01
|
||||
$ swift-ring-builder object.builder rebalance
|
||||
<distribute rings and wait>...
|
||||
|
||||
$ swift-ring-builder object.builder set_replicas 3.02
|
||||
$ swift-ring-builder object.builder rebalance
|
||||
<distribute rings and wait>...
|
||||
|
||||
Changes take effect after the ring is rebalanced. Therefore, if you
|
||||
intend to change from 3 replicas to 3.01 but you accidentally type
|
||||
2.01, no data is lost.
|
||||
|
||||
Additionally, the :command:`swift-ring-builder X.builder create` command can
|
||||
now take a decimal argument for the number of replicas.
|
||||
|
||||
Partition shift value
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The partition shift value is known internally to the Ring class as
|
||||
``_part_shift``. This value is used to shift an MD5 hash to calculate
|
||||
the partition where the data for that hash should reside. Only the top
|
||||
four bytes of the hash is used in this process. For example, to compute
|
||||
the partition for the ``/account/container/object`` path using Python:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
partition = unpack_from('>I',
|
||||
md5('/account/container/object').digest())[0] >>
|
||||
self._part_shift
|
||||
|
||||
For a ring generated with part\_power P, the partition shift value is
|
||||
``32 - P``.
|
||||
|
||||
Build the ring
|
||||
~~~~~~~~~~~~~~
|
||||
|
||||
The ring builder process includes these high-level steps:
|
||||
|
||||
#. The utility calculates the number of partitions to assign to each
|
||||
device based on the weight of the device. For example, for a
|
||||
partition at the power of 20, the ring has 1,048,576 partitions. One
|
||||
thousand devices of equal weight each want 1,048.576 partitions. The
|
||||
devices are sorted by the number of partitions they desire and kept
|
||||
in order throughout the initialization process.
|
||||
|
||||
.. note::
|
||||
|
||||
Each device is also assigned a random tiebreaker value that is
|
||||
used when two devices desire the same number of partitions. This
|
||||
tiebreaker is not stored on disk anywhere, and so two different
|
||||
rings created with the same parameters will have different
|
||||
partition assignments. For repeatable partition assignments,
|
||||
``RingBuilder.rebalance()`` takes an optional seed value that
|
||||
seeds the Python pseudo-random number generator.
|
||||
|
||||
#. The ring builder assigns each partition replica to the device that
|
||||
requires most partitions at that point while keeping it as far away
|
||||
as possible from other replicas. The ring builder prefers to assign a
|
||||
replica to a device in a region that does not already have a replica.
|
||||
If no such region is available, the ring builder searches for a
|
||||
device in a different zone, or on a different server. If it does not
|
||||
find one, it looks for a device with no replicas. Finally, if all
|
||||
options are exhausted, the ring builder assigns the replica to the
|
||||
device that has the fewest replicas already assigned.
|
||||
|
||||
.. note::
|
||||
|
||||
The ring builder assigns multiple replicas to one device only if
|
||||
the ring has fewer devices than it has replicas.
|
||||
|
||||
#. When building a new ring from an old ring, the ring builder
|
||||
recalculates the desired number of partitions that each device wants.
|
||||
|
||||
#. The ring builder unassigns partitions and gathers these partitions
|
||||
for reassignment, as follows:
|
||||
|
||||
- The ring builder unassigns any assigned partitions from any
|
||||
removed devices and adds these partitions to the gathered list.
|
||||
- The ring builder unassigns any partition replicas that can be
|
||||
spread out for better durability and adds these partitions to the
|
||||
gathered list.
|
||||
- The ring builder unassigns random partitions from any devices that
|
||||
have more partitions than they need and adds these partitions to
|
||||
the gathered list.
|
||||
|
||||
#. The ring builder reassigns the gathered partitions to devices by
|
||||
using a similar method to the one described previously.
|
||||
|
||||
#. When the ring builder reassigns a replica to a partition, the ring
|
||||
builder records the time of the reassignment. The ring builder uses
|
||||
this value when it gathers partitions for reassignment so that no
|
||||
partition is moved twice in a configurable amount of time. The
|
||||
RingBuilder class knows this configurable amount of time as
|
||||
``min_part_hours``. The ring builder ignores this restriction for
|
||||
replicas of partitions on removed devices because removal of a device
|
||||
happens on device failure only, and reassignment is the only choice.
|
||||
|
||||
These steps do not always perfectly rebalance a ring due to the random
|
||||
nature of gathering partitions for reassignment. To help reach a more
|
||||
balanced ring, the rebalance process is repeated until near perfect
|
||||
(less than 1 percent off) or when the balance does not improve by at
|
||||
least 1 percent (indicating we probably cannot get perfect balance due
|
||||
to wildly imbalanced zones or too many partitions recently moved).
|
@ -0,0 +1,32 @@
|
||||
==============================================================
|
||||
Configure project-specific image locations with Object Storage
|
||||
==============================================================
|
||||
|
||||
For some deployers, it is not ideal to store all images in one place to
|
||||
enable all projects and users to access them. You can configure the Image
|
||||
service to store image data in project-specific image locations. Then,
|
||||
only the following projects can use the Image service to access the
|
||||
created image:
|
||||
|
||||
- The project who owns the image
|
||||
- Projects that are defined in ``swift_store_admin_tenants`` and that
|
||||
have admin-level accounts
|
||||
|
||||
**To configure project-specific image locations**
|
||||
|
||||
#. Configure swift as your ``default_store`` in the
|
||||
``glance-api.conf`` file.
|
||||
|
||||
#. Set these configuration options in the ``glance-api.conf`` file:
|
||||
|
||||
- swift_store_multi_tenant
|
||||
Set to ``True`` to enable tenant-specific storage locations.
|
||||
Default is ``False``.
|
||||
|
||||
- swift_store_admin_tenants
|
||||
Specify a list of tenant IDs that can grant read and write access to all
|
||||
Object Storage containers that are created by the Image service.
|
||||
|
||||
With this configuration, images are stored in an Object Storage service
|
||||
(swift) endpoint that is pulled from the service catalog for the
|
||||
authenticated user.
|
208
doc/source/admin/objectstorage-troubleshoot.rst
Normal file
@ -0,0 +1,208 @@
|
||||
===========================
|
||||
Troubleshoot Object Storage
|
||||
===========================
|
||||
|
||||
For Object Storage, everything is logged in ``/var/log/syslog`` (or
|
||||
``messages`` on some distros). Several settings enable further
|
||||
customization of logging, such as ``log_name``, ``log_facility``, and
|
||||
``log_level``, within the object server configuration files.
|
||||
|
||||
Drive failure
|
||||
~~~~~~~~~~~~~
|
||||
|
||||
Problem
|
||||
-------
|
||||
|
||||
Drive failure can prevent Object Storage performing replication.
|
||||
|
||||
Solution
|
||||
--------
|
||||
|
||||
In the event that a drive has failed, the first step is to make sure the
|
||||
drive is unmounted. This will make it easier for Object Storage to work
|
||||
around the failure until it has been resolved. If the drive is going to
|
||||
be replaced immediately, then it is just best to replace the drive,
|
||||
format it, remount it, and let replication fill it up.
|
||||
|
||||
If you cannot replace the drive immediately, then it is best to leave it
|
||||
unmounted, and remove the drive from the ring. This will allow all the
|
||||
replicas that were on that drive to be replicated elsewhere until the
|
||||
drive is replaced. Once the drive is replaced, it can be re-added to the
|
||||
ring.
|
||||
|
||||
You can look at error messages in the ``/var/log/kern.log`` file for
|
||||
hints of drive failure.
|
||||
|
||||
Server failure
|
||||
~~~~~~~~~~~~~~
|
||||
|
||||
Problem
|
||||
-------
|
||||
|
||||
The server is potentially offline, and may have failed, or require a
|
||||
reboot.
|
||||
|
||||
Solution
|
||||
--------
|
||||
|
||||
If a server is having hardware issues, it is a good idea to make sure
|
||||
the Object Storage services are not running. This will allow Object
|
||||
Storage to work around the failure while you troubleshoot.
|
||||
|
||||
If the server just needs a reboot, or a small amount of work that should
|
||||
only last a couple of hours, then it is probably best to let Object
|
||||
Storage work around the failure and get the machine fixed and back
|
||||
online. When the machine comes back online, replication will make sure
|
||||
that anything that is missing during the downtime will get updated.
|
||||
|
||||
If the server has more serious issues, then it is probably best to
|
||||
remove all of the server's devices from the ring. Once the server has
|
||||
been repaired and is back online, the server's devices can be added back
|
||||
into the ring. It is important that the devices are reformatted before
|
||||
putting them back into the ring as it is likely to be responsible for a
|
||||
different set of partitions than before.
|
||||
|
||||
Detect failed drives
|
||||
~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Problem
|
||||
-------
|
||||
|
||||
When drives fail, it can be difficult to detect that a drive has failed,
|
||||
and the details of the failure.
|
||||
|
||||
Solution
|
||||
--------
|
||||
|
||||
It has been our experience that when a drive is about to fail, error
|
||||
messages appear in the ``/var/log/kern.log`` file. There is a script called
|
||||
``swift-drive-audit`` that can be run via cron to watch for bad drives. If
|
||||
errors are detected, it will unmount the bad drive, so that Object
|
||||
Storage can work around it. The script takes a configuration file with
|
||||
the following settings:
|
||||
|
||||
.. list-table:: **Description of configuration options for [drive-audit] in drive-audit.conf**
|
||||
:header-rows: 1
|
||||
|
||||
* - Configuration option = Default value
|
||||
- Description
|
||||
* - ``device_dir = /srv/node``
|
||||
- Directory devices are mounted under
|
||||
* - ``error_limit = 1``
|
||||
- Number of errors to find before a device is unmounted
|
||||
* - ``log_address = /dev/log``
|
||||
- Location where syslog sends the logs to
|
||||
* - ``log_facility = LOG_LOCAL0``
|
||||
- Syslog log facility
|
||||
* - ``log_file_pattern = /var/log/kern.*[!.][!g][!z]``
|
||||
- Location of the log file with globbing pattern to check against device
|
||||
errors locate device blocks with errors in the log file
|
||||
* - ``log_level = INFO``
|
||||
- Logging level
|
||||
* - ``log_max_line_length = 0``
|
||||
- Caps the length of log lines to the value given; no limit if set to 0,
|
||||
the default.
|
||||
* - ``log_to_console = False``
|
||||
- No help text available for this option.
|
||||
* - ``minutes = 60``
|
||||
- Number of minutes to look back in ``/var/log/kern.log``
|
||||
* - ``recon_cache_path = /var/cache/swift``
|
||||
- Directory where stats for a few items will be stored
|
||||
* - ``regex_pattern_1 = \berror\b.*\b(dm-[0-9]{1,2}\d?)\b``
|
||||
- No help text available for this option.
|
||||
* - ``unmount_failed_device = True``
|
||||
- No help text available for this option.
|
||||
|
||||
.. warning::
|
||||
|
||||
This script has only been tested on Ubuntu 10.04; use with caution on
|
||||
other operating systems in production.
|
||||
|
||||
Emergency recovery of ring builder files
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Problem
|
||||
-------
|
||||
|
||||
An emergency might prevent a successful backup from restoring the
|
||||
cluster to operational status.
|
||||
|
||||
Solution
|
||||
--------
|
||||
|
||||
You should always keep a backup of swift ring builder files. However, if
|
||||
an emergency occurs, this procedure may assist in returning your cluster
|
||||
to an operational state.
|
||||
|
||||
Using existing swift tools, there is no way to recover a builder file
|
||||
from a ``ring.gz`` file. However, if you have a knowledge of Python, it
|
||||
is possible to construct a builder file that is pretty close to the one
|
||||
you have lost.
|
||||
|
||||
.. warning::
|
||||
|
||||
This procedure is a last-resort for emergency circumstances. It
|
||||
requires knowledge of the swift python code and may not succeed.
|
||||
|
||||
#. Load the ring and a new ringbuilder object in a Python REPL:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
>>> from swift.common.ring import RingData, RingBuilder
|
||||
>>> ring = RingData.load('/path/to/account.ring.gz')
|
||||
|
||||
#. Start copying the data we have in the ring into the builder:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
>>> import math
|
||||
>>> partitions = len(ring._replica2part2dev_id[0])
|
||||
>>> replicas = len(ring._replica2part2dev_id)
|
||||
|
||||
>>> builder = RingBuilder(int(math.log(partitions, 2)), replicas, 1)
|
||||
>>> builder.devs = ring.devs
|
||||
>>> builder._replica2part2dev = ring._replica2part2dev_id
|
||||
>>> builder._last_part_moves_epoch = 0
|
||||
>>> from array import array
|
||||
>>> builder._last_part_moves = array('B', (0 for _ in xrange(partitions)))
|
||||
>>> builder._set_parts_wanted()
|
||||
>>> for d in builder._iter_devs():
|
||||
d['parts'] = 0
|
||||
>>> for p2d in builder._replica2part2dev:
|
||||
for dev_id in p2d:
|
||||
builder.devs[dev_id]['parts'] += 1
|
||||
|
||||
This is the extent of the recoverable fields.
|
||||
|
||||
#. For ``min_part_hours`` you either have to remember what the value you
|
||||
used was, or just make up a new one:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
>>> builder.change_min_part_hours(24) # or whatever you want it to be
|
||||
|
||||
#. Validate the builder. If this raises an exception, check your
|
||||
previous code:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
>>> builder.validate()
|
||||
|
||||
#. After it validates, save the builder and create a new ``account.builder``:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
>>> import pickle
|
||||
>>> pickle.dump(builder.to_dict(), open('account.builder', 'wb'), protocol=2)
|
||||
>>> exit ()
|
||||
|
||||
#. You should now have a file called ``account.builder`` in the current
|
||||
working directory. Run
|
||||
:command:`swift-ring-builder account.builder write_ring` and compare the new
|
||||
``account.ring.gz`` to the ``account.ring.gz`` that you started
|
||||
from. They probably are not byte-for-byte identical, but if you load them
|
||||
in a REPL and their ``_replica2part2dev_id`` and ``devs`` attributes are
|
||||
the same (or nearly so), then you are in good shape.
|
||||
|
||||
#. Repeat the procedure for ``container.ring.gz`` and
|
||||
``object.ring.gz``, and you might get usable builder files.
|
@ -93,6 +93,7 @@ Administrator Documentation
|
||||
replication_network
|
||||
logs
|
||||
ops_runbook/index
|
||||
admin/index
|
||||
|
||||
Object Storage v1 REST API Documentation
|
||||
========================================
|
||||
|