docs migration from openstack-manuals

Context for this is at
https://specs.openstack.org/openstack/docs-specs/specs/pike/os-manuals-migration.html

Change-Id: I9a4da27ce1d56b6406e2db979698038488f3cf6f
This commit is contained in:
John Dickinson 2017-07-05 17:04:26 -07:00
parent 37a1935198
commit 4cb76a41ce
27 changed files with 1450 additions and 0 deletions

Binary file not shown.

After

Width:  |  Height:  |  Size: 32 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 56 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 48 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 58 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 28 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 45 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 23 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 61 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 10 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 23 KiB

View File

@ -0,0 +1,22 @@
===================================
OpenStack Swift Administrator Guide
===================================
.. toctree::
:maxdepth: 2
objectstorage-intro.rst
objectstorage-features.rst
objectstorage-characteristics.rst
objectstorage-components.rst
objectstorage-ringbuilder.rst
objectstorage-arch.rst
objectstorage-replication.rst
objectstorage-large-objects.rst
objectstorage-auditors.rst
objectstorage-EC.rst
objectstorage-account-reaper.rst
objectstorage-tenant-specific-image-storage.rst
objectstorage-monitoring.rst
objectstorage-admin.rst
objectstorage-troubleshoot.rst

View File

@ -0,0 +1,31 @@
==============
Erasure coding
==============
Erasure coding is a set of algorithms that allows the reconstruction of
missing data from a set of original data. In theory, erasure coding uses
less capacity with similar durability characteristics as replicas.
From an application perspective, erasure coding support is transparent.
Object Storage (swift) implements erasure coding as a Storage Policy.
See `Storage Policies
<https://docs.openstack.org/developer/swift/overview_policies.html>`_
for more details.
There is no external API related to erasure coding. Create a container using a
Storage Policy; the interaction with the cluster is the same as any
other durability policy. Because support implements as a Storage Policy,
you can isolate all storage devices that associate with your cluster's
erasure coding capability. It is entirely possible to share devices between
storage policies, but for erasure coding it may make more sense to use
not only separate devices but possibly even entire nodes dedicated for erasure
coding.
.. important::
The erasure code support in Object Storage is considered beta in Kilo.
Most major functionality is included, but it has not been tested or
validated at large scale. This feature relies on ``ssync`` for durability.
We recommend deployers do extensive testing and not deploy production
data using an erasure code storage policy.
If any bugs are found during testing, please report them to
https://bugs.launchpad.net/swift

View File

@ -0,0 +1,51 @@
==============
Account reaper
==============
The purpose of the account reaper is to remove data from the deleted accounts.
A reseller marks an account for deletion by issuing a ``DELETE`` request
on the account's storage URL. This action sets the ``status`` column of
the account_stat table in the account database and replicas to
``DELETED``, marking the account's data for deletion.
Typically, a specific retention time or undelete are not provided.
However, you can set a ``delay_reaping`` value in the
``[account-reaper]`` section of the ``account-server.conf`` file to
delay the actual deletion of data. At this time, to undelete you have to update
the account database replicas directly, set the status column to an
empty string and update the put_timestamp to be greater than the
delete_timestamp.
.. note::
It is on the development to-do list to write a utility that performs
this task, preferably through a REST call.
The account reaper runs on each account server and scans the server
occasionally for account databases marked for deletion. It only fires up
on the accounts for which the server is the primary node, so that
multiple account servers aren't trying to do it simultaneously. Using
multiple servers to delete one account might improve the deletion speed
but requires coordination to avoid duplication. Speed really is not a
big concern with data deletion, and large accounts aren't deleted often.
Deleting an account is simple. For each account container, all objects
are deleted and then the container is deleted. Deletion requests that
fail will not stop the overall process but will cause the overall
process to fail eventually (for example, if an object delete times out,
you will not be able to delete the container or the account). The
account reaper keeps trying to delete an account until it is empty, at
which point the database reclaim process within the db\_replicator will
remove the database files.
A persistent error state may prevent the deletion of an object or
container. If this happens, you will see a message in the log, for example:
.. code-block:: console
Account <name> has not been reaped since <date>
You can control when this is logged with the ``reap_warn_after`` value in the
``[account-reaper]`` section of the ``account-server.conf`` file.
The default value is 30 days.

View File

@ -0,0 +1,11 @@
========================================
System administration for Object Storage
========================================
By understanding Object Storage concepts, you can better monitor and
administer your storage solution. The majority of the administration
information is maintained in developer documentation at
`docs.openstack.org/developer/swift/ <https://docs.openstack.org/developer/swift/>`__.
See the `OpenStack Configuration Reference <https://docs.openstack.org/ocata/config-reference/object-storage.html>`__
for a list of configuration options for Object Storage.

View File

@ -0,0 +1,88 @@
====================
Cluster architecture
====================
Access tier
~~~~~~~~~~~
Large-scale deployments segment off an access tier, which is considered
the Object Storage system's central hub. The access tier fields the
incoming API requests from clients and moves data in and out of the
system. This tier consists of front-end load balancers, ssl-terminators,
and authentication services. It runs the (distributed) brain of the
Object Storage system: the proxy server processes.
.. note::
If you want to use OpenStack Identity API v3 for authentication, you
have the following options available in ``/etc/swift/dispersion.conf``:
``auth_version``, ``user_domain_name``, ``project_domain_name``,
and ``project_name``.
**Object Storage architecture**
.. figure:: figures/objectstorage-arch.png
Because access servers are collocated in their own tier, you can scale
out read/write access regardless of the storage capacity. For example,
if a cluster is on the public Internet, requires SSL termination, and
has a high demand for data access, you can provision many access
servers. However, if the cluster is on a private network and used
primarily for archival purposes, you need fewer access servers.
Since this is an HTTP addressable storage service, you may incorporate a
load balancer into the access tier.
Typically, the tier consists of a collection of 1U servers. These
machines use a moderate amount of RAM and are network I/O intensive.
Since these systems field each incoming API request, you should
provision them with two high-throughput (10GbE) interfaces - one for the
incoming ``front-end`` requests and the other for the ``back-end`` access to
the object storage nodes to put and fetch data.
Factors to consider
-------------------
For most publicly facing deployments as well as private deployments
available across a wide-reaching corporate network, you use SSL to
encrypt traffic to the client. SSL adds significant processing load to
establish sessions between clients, which is why you have to provision
more capacity in the access layer. SSL may not be required for private
deployments on trusted networks.
Storage nodes
~~~~~~~~~~~~~
In most configurations, each of the five zones should have an equal
amount of storage capacity. Storage nodes use a reasonable amount of
memory and CPU. Metadata needs to be readily available to return objects
quickly. The object stores run services not only to field incoming
requests from the access tier, but to also run replicators, auditors,
and reapers. You can provision object stores provisioned with single
gigabit or 10 gigabit network interface depending on the expected
workload and desired performance.
**Object Storage (swift)**
.. figure:: figures/objectstorage-nodes.png
Currently, a 2 TB or 3 TB SATA disk delivers good performance for the
price. You can use desktop-grade drives if you have responsive remote
hands in the datacenter and enterprise-grade drives if you don't.
Factors to consider
-------------------
You should keep in mind the desired I/O performance for single-threaded
requests. This system does not use RAID, so a single disk handles each
request for an object. Disk performance impacts single-threaded response
rates.
To achieve apparent higher throughput, the object storage system is
designed to handle concurrent uploads/downloads. The network I/O
capacity (1GbE, bonded 1GbE pair, or 10GbE) should match your desired
concurrent throughput needs for reads and writes.

View File

@ -0,0 +1,30 @@
==============
Object Auditor
==============
On system failures, the XFS file system can sometimes truncate files it is
trying to write and produce zero-byte files. The object-auditor will catch
these problems but in the case of a system crash it is advisable to run
an extra, less rate limited sweep, to check for these specific files.
You can run this command as follows:
.. code-block:: console
$ swift-object-auditor /path/to/object-server/config/file.conf once -z 1000
.. note::
"-z" means to only check for zero-byte files at 1000 files per second.
It is useful to run the object auditor on a specific device or set of devices.
You can run the object-auditor once as follows:
.. code-block:: console
$ swift-object-auditor /path/to/object-server/config/file.conf once \
--devices=sda,sdb
.. note::
This will run the object auditor on only the ``sda`` and ``sdb`` devices.
This parameter accepts a comma-separated list of values.

View File

@ -0,0 +1,43 @@
==============================
Object Storage characteristics
==============================
The key characteristics of Object Storage are that:
- All objects stored in Object Storage have a URL.
- All objects stored are replicated 3✕ in as-unique-as-possible zones,
which can be defined as a group of drives, a node, a rack, and so on.
- All objects have their own metadata.
- Developers interact with the object storage system through a RESTful
HTTP API.
- Object data can be located anywhere in the cluster.
- The cluster scales by adding additional nodes without sacrificing
performance, which allows a more cost-effective linear storage
expansion than fork-lift upgrades.
- Data does not have to be migrated to an entirely new storage system.
- New nodes can be added to the cluster without downtime.
- Failed nodes and disks can be swapped out without downtime.
- It runs on industry-standard hardware, such as Dell, HP, and
Supermicro.
.. _objectstorage-figure:
Object Storage (swift)
.. figure:: figures/objectstorage.png
Developers can either write directly to the Swift API or use one of the
many client libraries that exist for all of the popular programming
languages, such as Java, Python, Ruby, and C#. Amazon S3 and RackSpace
Cloud Files users should be very familiar with Object Storage. Users new
to object storage systems will have to adjust to a different approach
and mindset than those required for a traditional filesystem.

View File

@ -0,0 +1,258 @@
==========
Components
==========
Object Storage uses the following components to deliver high
availability, high durability, and high concurrency:
- **Proxy servers** - Handle all of the incoming API requests.
- **Rings** - Map logical names of data to locations on particular
disks.
- **Zones** - Isolate data from other zones. A failure in one zone
does not impact the rest of the cluster as data replicates
across zones.
- **Accounts and containers** - Each account and container are
individual databases that are distributed across the cluster. An
account database contains the list of containers in that account. A
container database contains the list of objects in that container.
- **Objects** - The data itself.
- **Partitions** - A partition stores objects, account databases, and
container databases and helps manage locations where data lives in
the cluster.
.. _objectstorage-building-blocks-figure:
**Object Storage building blocks**
.. figure:: figures/objectstorage-buildingblocks.png
Proxy servers
-------------
Proxy servers are the public face of Object Storage and handle all of
the incoming API requests. Once a proxy server receives a request, it
determines the storage node based on the object's URL, for example:
https://swift.example.com/v1/account/container/object. Proxy servers
also coordinate responses, handle failures, and coordinate timestamps.
Proxy servers use a shared-nothing architecture and can be scaled as
needed based on projected workloads. A minimum of two proxy servers
should be deployed for redundancy. If one proxy server fails, the others
take over.
For more information concerning proxy server configuration, see
`Configuration Reference
<https://docs.openstack.org/ocata/config-reference/object-storage/proxy-server.html>`_.
Rings
-----
A ring represents a mapping between the names of entities stored on disks
and their physical locations. There are separate rings for accounts,
containers, and objects. When other components need to perform any
operation on an object, container, or account, they need to interact
with the appropriate ring to determine their location in the cluster.
The ring maintains this mapping using zones, devices, partitions, and
replicas. Each partition in the ring is replicated, by default, three
times across the cluster, and partition locations are stored in the
mapping maintained by the ring. The ring is also responsible for
determining which devices are used for handoff in failure scenarios.
Data can be isolated into zones in the ring. Each partition replica is
guaranteed to reside in a different zone. A zone could represent a
drive, a server, a cabinet, a switch, or even a data center.
The partitions of the ring are equally divided among all of the devices
in the Object Storage installation. When partitions need to be moved
around (for example, if a device is added to the cluster), the ring
ensures that a minimum number of partitions are moved at a time, and
only one replica of a partition is moved at a time.
You can use weights to balance the distribution of partitions on drives
across the cluster. This can be useful, for example, when differently
sized drives are used in a cluster.
The ring is used by the proxy server and several background processes
(like replication).
.. _objectstorage-ring-figure:
**The ring**
.. figure:: figures/objectstorage-ring.png
These rings are externally managed. The server processes themselves
do not modify the rings, they are instead given new rings modified by
other tools.
The ring uses a configurable number of bits from an ``MD5`` hash for a path
as a partition index that designates a device. The number of bits kept
from the hash is known as the partition power, and 2 to the partition
power indicates the partition count. Partitioning the full ``MD5`` hash ring
allows other parts of the cluster to work in batches of items at once
which ends up either more efficient or at least less complex than
working with each item separately or the entire cluster all at once.
Another configurable value is the replica count, which indicates how
many of the partition-device assignments make up a single ring. For a
given partition number, each replica's device will not be in the same
zone as any other replica's device. Zones can be used to group devices
based on physical locations, power separations, network separations, or
any other attribute that would improve the availability of multiple
replicas at the same time.
Zones
-----
Object Storage allows configuring zones in order to isolate failure
boundaries. If possible, each data replica resides in a separate zone.
At the smallest level, a zone could be a single drive or a grouping of a
few drives. If there were five object storage servers, then each server
would represent its own zone. Larger deployments would have an entire
rack (or multiple racks) of object servers, each representing a zone.
The goal of zones is to allow the cluster to tolerate significant
outages of storage servers without losing all replicas of the data.
.. _objectstorage-zones-figure:
**Zones**
.. figure:: figures/objectstorage-zones.png
Accounts and containers
-----------------------
Each account and container is an individual SQLite database that is
distributed across the cluster. An account database contains the list of
containers in that account. A container database contains the list of
objects in that container.
.. _objectstorage-accountscontainers-figure:
**Accounts and containers**
.. figure:: figures/objectstorage-accountscontainers.png
To keep track of object data locations, each account in the system has a
database that references all of its containers, and each container
database references each object.
Partitions
----------
A partition is a collection of stored data. This includes account databases,
container databases, and objects. Partitions are core to the replication
system.
Think of a partition as a bin moving throughout a fulfillment center
warehouse. Individual orders get thrown into the bin. The system treats
that bin as a cohesive entity as it moves throughout the system. A bin
is easier to deal with than many little things. It makes for fewer
moving parts throughout the system.
System replicators and object uploads/downloads operate on partitions.
As the system scales up, its behavior continues to be predictable
because the number of partitions is a fixed number.
Implementing a partition is conceptually simple, a partition is just a
directory sitting on a disk with a corresponding hash table of what it
contains.
.. _objectstorage-partitions-figure:
**Partitions**
.. figure:: figures/objectstorage-partitions.png
Replicators
-----------
In order to ensure that there are three copies of the data everywhere,
replicators continuously examine each partition. For each local
partition, the replicator compares it against the replicated copies in
the other zones to see if there are any differences.
The replicator knows if replication needs to take place by examining
hashes. A hash file is created for each partition, which contains hashes
of each directory in the partition. Each of the three hash files is
compared. For a given partition, the hash files for each of the
partition's copies are compared. If the hashes are different, then it is
time to replicate, and the directory that needs to be replicated is
copied over.
This is where partitions come in handy. With fewer things in the system,
larger chunks of data are transferred around (rather than lots of little
TCP connections, which is inefficient) and there is a consistent number
of hashes to compare.
The cluster eventually has a consistent behavior where the newest data
has a priority.
.. _objectstorage-replication-figure:
**Replication**
.. figure:: figures/objectstorage-replication.png
If a zone goes down, one of the nodes containing a replica notices and
proactively copies data to a handoff location.
Use cases
---------
The following sections show use cases for object uploads and downloads
and introduce the components.
Upload
~~~~~~
A client uses the REST API to make a HTTP request to PUT an object into
an existing container. The cluster receives the request. First, the
system must figure out where the data is going to go. To do this, the
account name, container name, and object name are all used to determine
the partition where this object should live.
Then a lookup in the ring figures out which storage nodes contain the
partitions in question.
The data is then sent to each storage node where it is placed in the
appropriate partition. At least two of the three writes must be
successful before the client is notified that the upload was successful.
Next, the container database is updated asynchronously to reflect that
there is a new object in it.
.. _objectstorage-usecase-figure:
**Object Storage in use**
.. figure:: figures/objectstorage-usecase.png
Download
~~~~~~~~
A request comes in for an account/container/object. Using the same
consistent hashing, the partition name is generated. A lookup in the
ring reveals which storage nodes contain that partition. A request is
made to one of the storage nodes to fetch the object and, if that fails,
requests are made to the other nodes.

View File

@ -0,0 +1,63 @@
=====================
Features and benefits
=====================
.. list-table::
:header-rows: 1
:widths: 10 40
* - Features
- Benefits
* - Leverages commodity hardware
- No lock-in, lower price/GB.
* - HDD/node failure agnostic
- Self-healing, reliable, data redundancy protects from failures.
* - Unlimited storage
- Large and flat namespace, highly scalable read/write access,
able to serve content directly from storage system.
* - Multi-dimensional scalability
- Scale-out architecture: Scale vertically and
horizontally-distributed storage. Backs up and archives large
amounts of data with linear performance.
* - Account/container/object structure
- No nesting, not a traditional file system: Optimized for scale,
it scales to multiple petabytes and billions of objects.
* - Built-in replication 3✕ + data redundancy (compared with 2✕ on
RAID)
- A configurable number of accounts, containers and object copies
for high availability.
* - Easily add capacity (unlike RAID resize)
- Elastic data scaling with ease.
* - No central database
- Higher performance, no bottlenecks.
* - RAID not required
- Handle many small, random reads and writes efficiently.
* - Built-in management utilities
- Account management: Create, add, verify, and delete users;
Container management: Upload, download, and verify; Monitoring:
Capacity, host, network, log trawling, and cluster health.
* - Drive auditing
- Detect drive failures preempting data corruption.
* - Expiring objects
- Users can set an expiration time or a TTL on an object to
control access.
* - Direct object access
- Enable direct browser access to content, such as for a control
panel.
* - Realtime visibility into client requests
- Know what users are requesting.
* - Supports S3 API
- Utilize tools that were designed for the popular S3 API.
* - Restrict containers per account
- Limit access to control usage by user.
* - Support for NetApp, Nexenta, Solidfire
- Unified support for block volumes using a variety of storage
systems.
* - Snapshot and backup API for block volumes.
- Data protection and recovery for VM data.
* - Standalone volume API available
- Separate endpoint and API for integration with other compute
systems.
* - Integration with Compute
- Fully integrated with Compute for attaching block volumes and
reporting on usage.

View File

@ -0,0 +1,23 @@
==============================
Introduction to Object Storage
==============================
OpenStack Object Storage (swift) is used for redundant, scalable data
storage using clusters of standardized servers to store petabytes of
accessible data. It is a long-term storage system for large amounts of
static data which can be retrieved and updated. Object Storage uses a
distributed architecture
with no central point of control, providing greater scalability,
redundancy, and permanence. Objects are written to multiple hardware
devices, with the OpenStack software responsible for ensuring data
replication and integrity across the cluster. Storage clusters scale
horizontally by adding new nodes. Should a node fail, OpenStack works to
replicate its content from other active nodes. Because OpenStack uses
software logic to ensure data replication and distribution across
different devices, inexpensive commodity hard drives and servers can be
used in lieu of more expensive equipment.
Object Storage is ideal for cost effective, scale-out storage. It
provides a fully distributed, API-accessible storage platform that can
be integrated directly into applications or used for backup, archiving,
and data retention.

View File

@ -0,0 +1,35 @@
====================
Large object support
====================
Object Storage (swift) uses segmentation to support the upload of large
objects. By default, Object Storage limits the download size of a single
object to 5GB. Using segmentation, uploading a single object is virtually
unlimited. The segmentation process works by fragmenting the object,
and automatically creating a file that sends the segments together as
a single object. This option offers greater upload speed with the possibility
of parallel uploads.
Large objects
~~~~~~~~~~~~~
The large object is comprised of two types of objects:
- **Segment objects** store the object content. You can divide your
content into segments, and upload each segment into its own segment
object. Segment objects do not have any special features. You create,
update, download, and delete segment objects just as you would normal
objects.
- A **manifest object** links the segment objects into one logical
large object. When you download a manifest object, Object Storage
concatenates and returns the contents of the segment objects in the
response body of the request. The manifest object types are:
- **Static large objects**
- **Dynamic large objects**
To find out more information on large object support, see `Large objects
<https://docs.openstack.org/user-guide/cli-swift-large-object-creation.html>`_
in the OpenStack End User Guide, or `Large Object Support
<https://docs.openstack.org/developer/swift/overview_large_objects.html>`_
in the developer documentation.

View File

@ -0,0 +1,228 @@
=========================
Object Storage monitoring
=========================
.. note::
This section was excerpted from a blog post by `Darrell
Bishop <http://swiftstack.com/blog/2012/04/11/swift-monitoring-with-statsd>`_ and
has since been edited.
An OpenStack Object Storage cluster is a collection of many daemons that
work together across many nodes. With so many different components, you
must be able to tell what is going on inside the cluster. Tracking
server-level meters like CPU utilization, load, memory consumption, disk
usage and utilization, and so on is necessary, but not sufficient.
Swift Recon
~~~~~~~~~~~
The Swift Recon middleware (see
`Defining Storage Policies <https://docs.openstack.org/developer/swift/admin_guide.html#cluster-telemetry-and-monitoring>`_)
provides general machine statistics, such as load average, socket
statistics, ``/proc/meminfo`` contents, as well as Swift-specific meters:
- The ``MD5`` sum of each ring file.
- The most recent object replication time.
- Count of each type of quarantined file: Account, container, or
object.
- Count of "async_pendings" (deferred container updates) on disk.
Swift Recon is middleware that is installed in the object servers
pipeline and takes one required option: A local cache directory. To
track ``async_pendings``, you must set up an additional cron job for
each object server. You access data by either sending HTTP requests
directly to the object server or using the ``swift-recon`` command-line
client.
There are Object Storage cluster statistics but the typical
server meters overlap with existing server monitoring systems. To get
the Swift-specific meters into a monitoring system, they must be polled.
Swift Recon acts as a middleware meters collector. The
process that feeds meters to your statistics system, such as
``collectd`` and ``gmond``, should already run on the storage node.
You can choose to either talk to Swift Recon or collect the meters
directly.
Swift-Informant
~~~~~~~~~~~~~~~
Swift-Informant middleware (see
`swift-informant <https://github.com/pandemicsyn/swift-informant>`_) has
real-time visibility into Object Storage client requests. It sits in the
pipeline for the proxy server, and after each request to the proxy server it
sends three meters to a ``StatsD`` server:
- A counter increment for a meter like ``obj.GET.200`` or
``cont.PUT.404``.
- Timing data for a meter like ``acct.GET.200`` or ``obj.GET.200``.
[The README says the meters look like ``duration.acct.GET.200``, but
I do not see the ``duration`` in the code. I am not sure what the
Etsy server does but our StatsD server turns timing meters into five
derivative meters with new segments appended, so it probably works as
coded. The first meter turns into ``acct.GET.200.lower``,
``acct.GET.200.upper``, ``acct.GET.200.mean``,
``acct.GET.200.upper_90``, and ``acct.GET.200.count``].
- A counter increase by the bytes transferred for a meter like
``tfer.obj.PUT.201``.
This is used for receiving information on the quality of service clients
experience with the timing meters, as well as sensing the volume of the
various modifications of a request server type, command, and response
code. Swift-Informant requires no change to core Object
Storage code because it is implemented as middleware. However, it gives
no insight into the workings of the cluster past the proxy server.
If the responsiveness of one storage node degrades, you can only see
that some of the requests are bad, either as high latency or error
status codes.
Statsdlog
~~~~~~~~~
The `Statsdlog <https://github.com/pandemicsyn/statsdlog>`_
project increments StatsD counters based on logged events. Like
Swift-Informant, it is also non-intrusive, however statsdlog can track
events from all Object Storage daemons, not just proxy-server. The
daemon listens to a UDP stream of syslog messages, and StatsD counters
are incremented when a log line matches a regular expression. Meter
names are mapped to regex match patterns in a JSON file, allowing
flexible configuration of what meters are extracted from the log stream.
Currently, only the first matching regex triggers a StatsD counter
increment, and the counter is always incremented by one. There is no way
to increment a counter by more than one or send timing data to StatsD
based on the log line content. The tool could be extended to handle more
meters for each line and data extraction, including timing data. But a
coupling would still exist between the log textual format and the log
parsing regexes, which would themselves be more complex to support
multiple matches for each line and data extraction. Also, log processing
introduces a delay between the triggering event and sending the data to
StatsD. It would be preferable to increment error counters where they
occur and send timing data as soon as it is known to avoid coupling
between a log string and a parsing regex and prevent a time delay
between events and sending data to StatsD.
The next section describes another method for gathering Object Storage
operational meters.
Swift StatsD logging
~~~~~~~~~~~~~~~~~~~~
StatsD (see `Measure Anything, Measure Everything
<http://codeascraft.etsy.com/2011/02/15/measure-anything-measure-everything/>`_)
was designed for application code to be deeply instrumented. Meters are
sent in real-time by the code that just noticed or did something. The
overhead of sending a meter is extremely low: a ``sendto`` of one UDP
packet. If that overhead is still too high, the StatsD client library
can send only a random portion of samples and StatsD approximates the
actual number when flushing meters upstream.
To avoid the problems inherent with middleware-based monitoring and
after-the-fact log processing, the sending of StatsD meters is
integrated into Object Storage itself. The submitted change set (see
`<https://review.openstack.org/#change,6058>`_) currently reports 124 meters
across 15 Object Storage daemons and the tempauth middleware. Details of
the meters tracked are in the `Administrator's
Guide <https://docs.openstack.org/developer/swift/admin_guide.html>`_.
The sending of meters is integrated with the logging framework. To
enable, configure ``log_statsd_host`` in the relevant config file. You
can also specify the port and a default sample rate. The specified
default sample rate is used unless a specific call to a statsd logging
method (see the list below) overrides it. Currently, no logging calls
override the sample rate, but it is conceivable that some meters may
require accuracy (``sample_rate=1``) while others may not.
.. code-block:: ini
[DEFAULT]
# ...
log_statsd_host = 127.0.0.1
log_statsd_port = 8125
log_statsd_default_sample_rate = 1
Then the LogAdapter object returned by ``get_logger()``, usually stored
in ``self.logger``, has these new methods:
- ``set_statsd_prefix(self, prefix)`` Sets the client library stat
prefix value which gets prefixed to every meter. The default prefix
is the ``name`` of the logger such as ``object-server``,
``container-auditor``, and so on. This is currently used to turn
``proxy-server`` into one of ``proxy-server.Account``,
``proxy-server.Container``, or ``proxy-server.Object`` as soon as the
Controller object is determined and instantiated for the request.
- ``update_stats(self, metric, amount, sample_rate=1)`` Increments
the supplied meter by the given amount. This is used when you need
to add or subtract more that one from a counter, like incrementing
``suffix.hashes`` by the number of computed hashes in the object
replicator.
- ``increment(self, metric, sample_rate=1)`` Increments the given counter
meter by one.
- ``decrement(self, metric, sample_rate=1)`` Lowers the given counter
meter by one.
- ``timing(self, metric, timing_ms, sample_rate=1)`` Record that the
given meter took the supplied number of milliseconds.
- ``timing_since(self, metric, orig_time, sample_rate=1)``
Convenience method to record a timing meter whose value is "now"
minus an existing timestamp.
.. note::
These logging methods may safely be called anywhere you have a
logger object. If StatsD logging has not been configured, the methods
are no-ops. This avoids messy conditional logic each place a meter is
recorded. These example usages show the new logging methods:
.. code-block:: python
# swift/obj/replicator.py
def update(self, job):
# ...
begin = time.time()
try:
hashed, local_hash = tpool.execute(tpooled_get_hashes, job['path'],
do_listdir=(self.replication_count % 10) == 0,
reclaim_age=self.reclaim_age)
# See tpooled_get_hashes "Hack".
if isinstance(hashed, BaseException):
raise hashed
self.suffix_hash += hashed
self.logger.update_stats('suffix.hashes', hashed)
# ...
finally:
self.partition_times.append(time.time() - begin)
self.logger.timing_since('partition.update.timing', begin)
.. code-block:: python
# swift/container/updater.py
def process_container(self, dbfile):
# ...
start_time = time.time()
# ...
for event in events:
if 200 <= event.wait() < 300:
successes += 1
else:
failures += 1
if successes > failures:
self.logger.increment('successes')
# ...
else:
self.logger.increment('failures')
# ...
# Only track timing data for attempted updates:
self.logger.timing_since('timing', start_time)
else:
self.logger.increment('no_changes')
self.no_changes += 1

View File

@ -0,0 +1,98 @@
===========
Replication
===========
Because each replica in Object Storage functions independently and
clients generally require only a simple majority of nodes to respond to
consider an operation successful, transient failures like network
partitions can quickly cause replicas to diverge. These differences are
eventually reconciled by asynchronous, peer-to-peer replicator
processes. The replicator processes traverse their local file systems
and concurrently perform operations in a manner that balances load
across physical disks.
Replication uses a push model, with records and files generally only
being copied from local to remote replicas. This is important because
data on the node might not belong there (as in the case of hand offs and
ring changes), and a replicator cannot know which data it should pull in
from elsewhere in the cluster. Any node that contains data must ensure
that data gets to where it belongs. The ring handles replica placement.
To replicate deletions in addition to creations, every deleted record or
file in the system is marked by a tombstone. The replication process
cleans up tombstones after a time period known as the ``consistency
window``. This window defines the duration of the replication and how
long transient failure can remove a node from the cluster. Tombstone
cleanup must be tied to replication to reach replica convergence.
If a replicator detects that a remote drive has failed, the replicator
uses the ``get_more_nodes`` interface for the ring to choose an
alternate node with which to synchronize. The replicator can maintain
desired levels of replication during disk failures, though some replicas
might not be in an immediately usable location.
.. note::
The replicator does not maintain desired levels of replication when
failures such as entire node failures occur; most failures are
transient.
The main replication types are:
- Database replication
Replicates containers and objects.
- Object replication
Replicates object data.
Database replication
~~~~~~~~~~~~~~~~~~~~
Database replication completes a low-cost hash comparison to determine
whether two replicas already match. Normally, this check can quickly
verify that most databases in the system are already synchronized. If
the hashes differ, the replicator synchronizes the databases by sharing
records added since the last synchronization point.
This synchronization point is a high water mark that notes the last
record at which two databases were known to be synchronized, and is
stored in each database as a tuple of the remote database ID and record
ID. Database IDs are unique across all replicas of the database, and
record IDs are monotonically increasing integers. After all new records
are pushed to the remote database, the entire synchronization table of
the local database is pushed, so the remote database can guarantee that
it is synchronized with everything with which the local database was
previously synchronized.
If a replica is missing, the whole local database file is transmitted to
the peer by using rsync(1) and is assigned a new unique ID.
In practice, database replication can process hundreds of databases per
concurrency setting per second (up to the number of available CPUs or
disks) and is bound by the number of database transactions that must be
performed.
Object replication
~~~~~~~~~~~~~~~~~~
The initial implementation of object replication performed an rsync to
push data from a local partition to all remote servers where it was
expected to reside. While this worked at small scale, replication times
skyrocketed once directory structures could no longer be held in RAM.
This scheme was modified to save a hash of the contents for each suffix
directory to a per-partition hashes file. The hash for a suffix
directory is no longer valid when the contents of that suffix directory
is modified.
The object replication process reads in hash files and calculates any
invalidated hashes. Then, it transmits the hashes to each remote server
that should hold the partition, and only suffix directories with
differing hashes on the remote server are rsynced. After pushing files
to the remote server, the replication process notifies it to recalculate
hashes for the rsynced suffix directories.
The number of uncached directories that object replication must
traverse, usually as a result of invalidated suffix directory hashes,
impedes performance. To provide acceptable replication speeds, object
replication is designed to invalidate around 2 percent of the hash space
on a normal node each day.

View File

@ -0,0 +1,228 @@
============
Ring-builder
============
Use the swift-ring-builder utility to build and manage rings. This
utility assigns partitions to devices and writes an optimized Python
structure to a gzipped, serialized file on disk for transmission to the
servers. The server processes occasionally check the modification time
of the file and reload in-memory copies of the ring structure as needed.
If you use a slightly older version of the ring, one of the three
replicas for a partition subset will be incorrect because of the way the
ring-builder manages changes to the ring. You can work around this
issue.
The ring-builder also keeps its own builder file with the ring
information and additional data required to build future rings. It is
very important to keep multiple backup copies of these builder files.
One option is to copy the builder files out to every server while
copying the ring files themselves. Another is to upload the builder
files into the cluster itself. If you lose the builder file, you have to
create a new ring from scratch. Nearly all partitions would be assigned
to different devices and, therefore, nearly all of the stored data would
have to be replicated to new locations. So, recovery from a builder file
loss is possible, but data would be unreachable for an extended time.
Ring data structure
~~~~~~~~~~~~~~~~~~~
The ring data structure consists of three top level fields: a list of
devices in the cluster, a list of lists of device ids indicating
partition to device assignments, and an integer indicating the number of
bits to shift an MD5 hash to calculate the partition for the hash.
Partition assignment list
~~~~~~~~~~~~~~~~~~~~~~~~~
This is a list of ``array('H')`` of devices ids. The outermost list
contains an ``array('H')`` for each replica. Each ``array('H')`` has a
length equal to the partition count for the ring. Each integer in the
``array('H')`` is an index into the above list of devices. The partition
list is known internally to the Ring class as ``_replica2part2dev_id``.
So, to create a list of device dictionaries assigned to a partition, the
Python code would look like:
.. code-block:: python
devices = [self.devs[part2dev_id[partition]] for
part2dev_id in self._replica2part2dev_id]
That code is a little simplistic because it does not account for the
removal of duplicate devices. If a ring has more replicas than devices,
a partition will have more than one replica on a device.
``array('H')`` is used for memory conservation as there may be millions
of partitions.
Overload
~~~~~~~~
The ring builder tries to keep replicas as far apart as possible while
still respecting device weights. When it can not do both, the overload
factor determines what happens. Each device takes an extra
fraction of its desired partitions to allow for replica dispersion;
after that extra fraction is exhausted, replicas are placed closer
together than optimal.
The overload factor lets the operator trade off replica
dispersion (durability) against data dispersion (uniform disk usage).
The default overload factor is 0, so device weights are strictly
followed.
With an overload factor of 0.1, each device accepts 10% more
partitions than it otherwise would, but only if it needs to maintain
partition dispersion.
For example, consider a 3-node cluster of machines with equal-size disks;
node A has 12 disks, node B has 12 disks, and node C has
11 disks. The ring has an overload factor of 0.1 (10%).
Without the overload, some partitions would end up with replicas only
on nodes A and B. However, with the overload, every device can accept
up to 10% more partitions for the sake of dispersion. The
missing disk in C means there is one disk's worth of partitions
to spread across the remaining 11 disks, which gives each
disk in C an extra 9.09% load. Since this is less than the 10%
overload, there is one replica of each partition on each node.
However, this does mean that the disks in node C have more data
than the disks in nodes A and B. If 80% full is the warning
threshold for the cluster, node C's disks reach 80% full while A
and B's disks are only 72.7% full.
Replica counts
~~~~~~~~~~~~~~
To support the gradual change in replica counts, a ring can have a real
number of replicas and is not restricted to an integer number of
replicas.
A fractional replica count is for the whole ring and not for individual
partitions. It indicates the average number of replicas for each
partition. For example, a replica count of 3.2 means that 20 percent of
partitions have four replicas and 80 percent have three replicas.
The replica count is adjustable. For example:
.. code-block:: console
$ swift-ring-builder account.builder set_replicas 4
$ swift-ring-builder account.builder rebalance
You must rebalance the replica ring in globally distributed clusters.
Operators of these clusters generally want an equal number of replicas
and regions. Therefore, when an operator adds or removes a region, the
operator adds or removes a replica. Removing unneeded replicas saves on
the cost of disks.
You can gradually increase the replica count at a rate that does not
adversely affect cluster performance. For example:
.. code-block:: console
$ swift-ring-builder object.builder set_replicas 3.01
$ swift-ring-builder object.builder rebalance
<distribute rings and wait>...
$ swift-ring-builder object.builder set_replicas 3.02
$ swift-ring-builder object.builder rebalance
<distribute rings and wait>...
Changes take effect after the ring is rebalanced. Therefore, if you
intend to change from 3 replicas to 3.01 but you accidentally type
2.01, no data is lost.
Additionally, the :command:`swift-ring-builder X.builder create` command can
now take a decimal argument for the number of replicas.
Partition shift value
~~~~~~~~~~~~~~~~~~~~~
The partition shift value is known internally to the Ring class as
``_part_shift``. This value is used to shift an MD5 hash to calculate
the partition where the data for that hash should reside. Only the top
four bytes of the hash is used in this process. For example, to compute
the partition for the ``/account/container/object`` path using Python:
.. code-block:: python
partition = unpack_from('>I',
md5('/account/container/object').digest())[0] >>
self._part_shift
For a ring generated with part\_power P, the partition shift value is
``32 - P``.
Build the ring
~~~~~~~~~~~~~~
The ring builder process includes these high-level steps:
#. The utility calculates the number of partitions to assign to each
device based on the weight of the device. For example, for a
partition at the power of 20, the ring has 1,048,576 partitions. One
thousand devices of equal weight each want 1,048.576 partitions. The
devices are sorted by the number of partitions they desire and kept
in order throughout the initialization process.
.. note::
Each device is also assigned a random tiebreaker value that is
used when two devices desire the same number of partitions. This
tiebreaker is not stored on disk anywhere, and so two different
rings created with the same parameters will have different
partition assignments. For repeatable partition assignments,
``RingBuilder.rebalance()`` takes an optional seed value that
seeds the Python pseudo-random number generator.
#. The ring builder assigns each partition replica to the device that
requires most partitions at that point while keeping it as far away
as possible from other replicas. The ring builder prefers to assign a
replica to a device in a region that does not already have a replica.
If no such region is available, the ring builder searches for a
device in a different zone, or on a different server. If it does not
find one, it looks for a device with no replicas. Finally, if all
options are exhausted, the ring builder assigns the replica to the
device that has the fewest replicas already assigned.
.. note::
The ring builder assigns multiple replicas to one device only if
the ring has fewer devices than it has replicas.
#. When building a new ring from an old ring, the ring builder
recalculates the desired number of partitions that each device wants.
#. The ring builder unassigns partitions and gathers these partitions
for reassignment, as follows:
- The ring builder unassigns any assigned partitions from any
removed devices and adds these partitions to the gathered list.
- The ring builder unassigns any partition replicas that can be
spread out for better durability and adds these partitions to the
gathered list.
- The ring builder unassigns random partitions from any devices that
have more partitions than they need and adds these partitions to
the gathered list.
#. The ring builder reassigns the gathered partitions to devices by
using a similar method to the one described previously.
#. When the ring builder reassigns a replica to a partition, the ring
builder records the time of the reassignment. The ring builder uses
this value when it gathers partitions for reassignment so that no
partition is moved twice in a configurable amount of time. The
RingBuilder class knows this configurable amount of time as
``min_part_hours``. The ring builder ignores this restriction for
replicas of partitions on removed devices because removal of a device
happens on device failure only, and reassignment is the only choice.
These steps do not always perfectly rebalance a ring due to the random
nature of gathering partitions for reassignment. To help reach a more
balanced ring, the rebalance process is repeated until near perfect
(less than 1 percent off) or when the balance does not improve by at
least 1 percent (indicating we probably cannot get perfect balance due
to wildly imbalanced zones or too many partitions recently moved).

View File

@ -0,0 +1,32 @@
==============================================================
Configure project-specific image locations with Object Storage
==============================================================
For some deployers, it is not ideal to store all images in one place to
enable all projects and users to access them. You can configure the Image
service to store image data in project-specific image locations. Then,
only the following projects can use the Image service to access the
created image:
- The project who owns the image
- Projects that are defined in ``swift_store_admin_tenants`` and that
have admin-level accounts
**To configure project-specific image locations**
#. Configure swift as your ``default_store`` in the
``glance-api.conf`` file.
#. Set these configuration options in the ``glance-api.conf`` file:
- swift_store_multi_tenant
Set to ``True`` to enable tenant-specific storage locations.
Default is ``False``.
- swift_store_admin_tenants
Specify a list of tenant IDs that can grant read and write access to all
Object Storage containers that are created by the Image service.
With this configuration, images are stored in an Object Storage service
(swift) endpoint that is pulled from the service catalog for the
authenticated user.

View File

@ -0,0 +1,208 @@
===========================
Troubleshoot Object Storage
===========================
For Object Storage, everything is logged in ``/var/log/syslog`` (or
``messages`` on some distros). Several settings enable further
customization of logging, such as ``log_name``, ``log_facility``, and
``log_level``, within the object server configuration files.
Drive failure
~~~~~~~~~~~~~
Problem
-------
Drive failure can prevent Object Storage performing replication.
Solution
--------
In the event that a drive has failed, the first step is to make sure the
drive is unmounted. This will make it easier for Object Storage to work
around the failure until it has been resolved. If the drive is going to
be replaced immediately, then it is just best to replace the drive,
format it, remount it, and let replication fill it up.
If you cannot replace the drive immediately, then it is best to leave it
unmounted, and remove the drive from the ring. This will allow all the
replicas that were on that drive to be replicated elsewhere until the
drive is replaced. Once the drive is replaced, it can be re-added to the
ring.
You can look at error messages in the ``/var/log/kern.log`` file for
hints of drive failure.
Server failure
~~~~~~~~~~~~~~
Problem
-------
The server is potentially offline, and may have failed, or require a
reboot.
Solution
--------
If a server is having hardware issues, it is a good idea to make sure
the Object Storage services are not running. This will allow Object
Storage to work around the failure while you troubleshoot.
If the server just needs a reboot, or a small amount of work that should
only last a couple of hours, then it is probably best to let Object
Storage work around the failure and get the machine fixed and back
online. When the machine comes back online, replication will make sure
that anything that is missing during the downtime will get updated.
If the server has more serious issues, then it is probably best to
remove all of the server's devices from the ring. Once the server has
been repaired and is back online, the server's devices can be added back
into the ring. It is important that the devices are reformatted before
putting them back into the ring as it is likely to be responsible for a
different set of partitions than before.
Detect failed drives
~~~~~~~~~~~~~~~~~~~~
Problem
-------
When drives fail, it can be difficult to detect that a drive has failed,
and the details of the failure.
Solution
--------
It has been our experience that when a drive is about to fail, error
messages appear in the ``/var/log/kern.log`` file. There is a script called
``swift-drive-audit`` that can be run via cron to watch for bad drives. If
errors are detected, it will unmount the bad drive, so that Object
Storage can work around it. The script takes a configuration file with
the following settings:
.. list-table:: **Description of configuration options for [drive-audit] in drive-audit.conf**
:header-rows: 1
* - Configuration option = Default value
- Description
* - ``device_dir = /srv/node``
- Directory devices are mounted under
* - ``error_limit = 1``
- Number of errors to find before a device is unmounted
* - ``log_address = /dev/log``
- Location where syslog sends the logs to
* - ``log_facility = LOG_LOCAL0``
- Syslog log facility
* - ``log_file_pattern = /var/log/kern.*[!.][!g][!z]``
- Location of the log file with globbing pattern to check against device
errors locate device blocks with errors in the log file
* - ``log_level = INFO``
- Logging level
* - ``log_max_line_length = 0``
- Caps the length of log lines to the value given; no limit if set to 0,
the default.
* - ``log_to_console = False``
- No help text available for this option.
* - ``minutes = 60``
- Number of minutes to look back in ``/var/log/kern.log``
* - ``recon_cache_path = /var/cache/swift``
- Directory where stats for a few items will be stored
* - ``regex_pattern_1 = \berror\b.*\b(dm-[0-9]{1,2}\d?)\b``
- No help text available for this option.
* - ``unmount_failed_device = True``
- No help text available for this option.
.. warning::
This script has only been tested on Ubuntu 10.04; use with caution on
other operating systems in production.
Emergency recovery of ring builder files
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Problem
-------
An emergency might prevent a successful backup from restoring the
cluster to operational status.
Solution
--------
You should always keep a backup of swift ring builder files. However, if
an emergency occurs, this procedure may assist in returning your cluster
to an operational state.
Using existing swift tools, there is no way to recover a builder file
from a ``ring.gz`` file. However, if you have a knowledge of Python, it
is possible to construct a builder file that is pretty close to the one
you have lost.
.. warning::
This procedure is a last-resort for emergency circumstances. It
requires knowledge of the swift python code and may not succeed.
#. Load the ring and a new ringbuilder object in a Python REPL:
.. code-block:: python
>>> from swift.common.ring import RingData, RingBuilder
>>> ring = RingData.load('/path/to/account.ring.gz')
#. Start copying the data we have in the ring into the builder:
.. code-block:: python
>>> import math
>>> partitions = len(ring._replica2part2dev_id[0])
>>> replicas = len(ring._replica2part2dev_id)
>>> builder = RingBuilder(int(math.log(partitions, 2)), replicas, 1)
>>> builder.devs = ring.devs
>>> builder._replica2part2dev = ring._replica2part2dev_id
>>> builder._last_part_moves_epoch = 0
>>> from array import array
>>> builder._last_part_moves = array('B', (0 for _ in xrange(partitions)))
>>> builder._set_parts_wanted()
>>> for d in builder._iter_devs():
d['parts'] = 0
>>> for p2d in builder._replica2part2dev:
for dev_id in p2d:
builder.devs[dev_id]['parts'] += 1
This is the extent of the recoverable fields.
#. For ``min_part_hours`` you either have to remember what the value you
used was, or just make up a new one:
.. code-block:: python
>>> builder.change_min_part_hours(24) # or whatever you want it to be
#. Validate the builder. If this raises an exception, check your
previous code:
.. code-block:: python
>>> builder.validate()
#. After it validates, save the builder and create a new ``account.builder``:
.. code-block:: python
>>> import pickle
>>> pickle.dump(builder.to_dict(), open('account.builder', 'wb'), protocol=2)
>>> exit ()
#. You should now have a file called ``account.builder`` in the current
working directory. Run
:command:`swift-ring-builder account.builder write_ring` and compare the new
``account.ring.gz`` to the ``account.ring.gz`` that you started
from. They probably are not byte-for-byte identical, but if you load them
in a REPL and their ``_replica2part2dev_id`` and ``devs`` attributes are
the same (or nearly so), then you are in good shape.
#. Repeat the procedure for ``container.ring.gz`` and
``object.ring.gz``, and you might get usable builder files.

View File

@ -93,6 +93,7 @@ Administrator Documentation
replication_network replication_network
logs logs
ops_runbook/index ops_runbook/index
admin/index
Object Storage v1 REST API Documentation Object Storage v1 REST API Documentation
======================================== ========================================