Paul Luse e52e8bc917 Add Storage Policy Documentation

Add overview and example information for using Storage Policies.

DocImpact
Implements: blueprint storage-policies
Change-Id: I6f11f7a1bdaa6f3defb3baa56a820050e5f727f1

2014-06-19 10:18:34 -07:00

64 KiB

Raw Blame History

Administrator's Guide

Defining Storage Policies

Defining your Storage Policies is very easy to do with Swift. It is important that the administrator understand the concepts behind Storage Policies before actually creating and using them in order to get the most benefit out of the feature and, more importantly, to avoid having to make unnecessary changes once a set of policies have been deployed to a cluster.

It is highly recommended that the reader fully read and comprehend overview_policies before proceeding with administration of policies. Plan carefully and it is suggested that experimentation be done first on a non-production cluster to be certain that the desired configuration meets the needs of the users. See upgrade-policy before planning the upgrade of your existing deployment.

Following is a high level view of the very few steps it takes to configure policies once you have decided what you want to do:

Define your policies in /etc/swift/swift.conf

Create the corresponding object rings

Communicate the names of the Storage Policies to cluster users

For a specific example that takes you through these steps, please see policies_saio

Managing the Rings

You may build the storage rings on any server with the appropriate version of Swift installed. Once built or changed (rebalanced), you must distribute the rings to all the servers in the cluster. Storage rings contain information about all the Swift storage partitions and how they are distributed between the different nodes and disks.

Swift 1.6.0 is the last version to use a Python pickle format. Subsequent versions use a different serialization format. Rings generated by Swift versions 1.6.0 and earlier may be read by any version, but rings generated after 1.6.0 may only be read by Swift versions greater than 1.6.0. So when upgrading from version 1.6.0 or earlier to a version greater than 1.6.0, either upgrade Swift on your ring building server last after all Swift nodes have been successfully upgraded, or refrain from generating rings until all Swift nodes have been successfully upgraded.

If you need to downgrade from a version of swift greater than 1.6.0 to a version less than or equal to 1.6.0, first downgrade your ring-building server, generate new rings, push them out, then continue with the rest of the downgrade.

For more information see overview_ring.

Removing a device from the ring:

swift-ring-builder <builder-file> remove <ip_address>/<device_name>

Removing a server from the ring:

swift-ring-builder <builder-file> remove <ip_address>

Adding devices to the ring:

See ring-preparing

See what devices for a server are in the ring:

swift-ring-builder <builder-file> search <ip_address>

Once you are done with all changes to the ring, the changes need to be "committed":

swift-ring-builder <builder-file> rebalance

Once the new rings are built, they should be pushed out to all the servers in the cluster.

Optionally, if invoked as 'swift-ring-builder-safe' the directory containing the specified builder file will be locked (via a .lock file in the parent directory). This provides a basic safe guard against multiple instances of the swift-ring-builder (or other utilities that observe this lock) from attempting to write to or read the builder/ring files while operations are in progress. This can be useful in environments where ring management has been automated but the operator still needs to interact with the rings manually.

Scripting Ring Creation

You can create scripts to create the account and container rings and rebalance. Here's an example script for the Account ring. Use similar commands to create a make-container-ring.sh script on the proxy server node.

Create a script file called make-account-ring.sh on the proxy server node with the following content:
```
#!/bin/bash
cd /etc/swift
rm -f account.builder account.ring.gz backups/account.builder backups/account.ring.gz
swift-ring-builder account.builder create 18 3 1
swift-ring-builder account.builder add z1-<account-server-1>:6002/sdb1 1
swift-ring-builder account.builder add z2-<account-server-2>:6002/sdb1 1
swift-ring-builder account.builder rebalance
```
You need to replace the values of <account-server-1>, <account-server-2>, etc. with the IP addresses of the account servers used in your setup. You can have as many account servers as you need. All account servers are assumed to be listening on port 6002, and have a storage device called "sdb1" (this is a directory name created under /drives when we setup the account server). The "z1", "z2", etc. designate zones, and you can choose whether you put devices in the same or different zones.
Make the script file executable and run it to create the account ring file:
```
chmod +x make-account-ring.sh
sudo ./make-account-ring.sh
```
Copy the resulting ring file /etc/swift/account.ring.gz to all the account server nodes in your Swift environment, and put them in the /etc/swift directory on these nodes. Make sure that every time you change the account ring configuration, you copy the resulting ring file to all the account nodes.

Handling System Updates

It is recommended that system updates and reboots are done a zone at a time. This allows the update to happen, and for the Swift cluster to stay available and responsive to requests. It is also advisable when updating a zone, let it run for a while before updating the other zones to make sure the update doesn't have any adverse effects.

Handling Drive Failure

In the event that a drive has failed, the first step is to make sure the drive is unmounted. This will make it easier for swift to work around the failure until it has been resolved. If the drive is going to be replaced immediately, then it is just best to replace the drive, format it, remount it, and let replication fill it up.

If the drive can't be replaced immediately, then it is best to leave it unmounted, and remove the drive from the ring. This will allow all the replicas that were on that drive to be replicated elsewhere until the drive is replaced. Once the drive is replaced, it can be re-added to the ring.

Handling Server Failure

If a server is having hardware issues, it is a good idea to make sure the swift services are not running. This will allow Swift to work around the failure while you troubleshoot.

If the server just needs a reboot, or a small amount of work that should only last a couple of hours, then it is probably best to let Swift work around the failure and get the machine fixed and back online. When the machine comes back online, replication will make sure that anything that is missing during the downtime will get updated.

If the server has more serious issues, then it is probably best to remove all of the server's devices from the ring. Once the server has been repaired and is back online, the server's devices can be added back into the ring. It is important that the devices are reformatted before putting them back into the ring as it is likely to be responsible for a different set of partitions than before.

Detecting Failed Drives

It has been our experience that when a drive is about to fail, error messages will spew into /var/log/kern.log. There is a script called swift-drive-audit that can be run via cron to watch for bad drives. If errors are detected, it will unmount the bad drive, so that Swift can work around it. The script takes a configuration file with the following settings:

[drive-audit]

Option	Default	Description
log_facility	LOG_LOCAL0	Syslog log facility
log_level	INFO	Log level
device_dir	/srv/node	Directory devices are mounted under
minutes	60	Number of minutes to look back in /var/log/kern.log
error_limit	1	Number of errors to find before a device is unmounted
log_file_pattern	/var/log/kern*	Location of the log file with globbing pattern to check against device errors
regex_pattern_X	(see below)	Regular expression patterns to be used to locate device blocks with errors in the log file

The default regex pattern used to locate device blocks with errors are berrorb.*b(sd[a-z]{1,2}d?)b and b(sd[a-z]{1,2}d?)b.*berrorb. One is able to overwrite the default above by providing new expressions using the format regex_pattern_X = regex_expression, where X is a number.

This script has been tested on Ubuntu 10.04 and Ubuntu 12.04, so if you are using a different distro or OS, some care should be taken before using in production.

Cluster Health

There is a swift-dispersion-report tool for measuring overall cluster health. This is accomplished by checking if a set of deliberately distributed containers and objects are currently in their proper places within the cluster.

For instance, a common deployment has three replicas of each object. The health of that object can be measured by checking if each replica is in its proper place. If only 2 of the 3 is in place the object's heath can be said to be at 66.66%, where 100% would be perfect.

A single object's health, especially an older object, usually reflects the health of that entire partition the object is in. If we make enough objects on a distinct percentage of the partitions in the cluster, we can get a pretty valid estimate of the overall cluster health. In practice, about 1% partition coverage seems to balance well between accuracy and the amount of time it takes to gather results.

The first thing that needs to be done to provide this health value is create a new account solely for this usage. Next, we need to place the containers and objects throughout the system so that they are on distinct partitions. The swift-dispersion-populate tool does this by making up random container and object names until they fall on distinct partitions. Last, and repeatedly for the life of the cluster, we need to run the swift-dispersion-report tool to check the health of each of these containers and objects.

These tools need direct access to the entire cluster and to the ring files (installing them on a proxy server will probably do). Both swift-dispersion-populate and swift-dispersion-report use the same configuration file, /etc/swift/dispersion.conf. Example conf file:

[dispersion]
auth_url = http://localhost:8080/auth/v1.0
auth_user = test:tester
auth_key = testing
endpoint_type = internalURL

There are also options for the conf file for specifying the dispersion coverage (defaults to 1%), retries, concurrency, etc. though usually the defaults are fine.

Once the configuration is in place, run swift-dispersion-populate to populate the containers and objects throughout the cluster.

Now that those containers and objects are in place, you can run swift-dispersion-report to get a dispersion report, or the overall health of the cluster. Here is an example of a cluster in perfect health:

$ swift-dispersion-report
Queried 2621 containers for dispersion reporting, 19s, 0 retries
100.00% of container copies found (7863 of 7863)
Sample represents 1.00% of the container partition space

Queried 2619 objects for dispersion reporting, 7s, 0 retries
100.00% of object copies found (7857 of 7857)
Sample represents 1.00% of the object partition space

Now I'll deliberately double the weight of a device in the object ring (with replication turned off) and rerun the dispersion report to show what impact that has:

$ swift-ring-builder object.builder set_weight d0 200
$ swift-ring-builder object.builder rebalance
...
$ swift-dispersion-report
Queried 2621 containers for dispersion reporting, 8s, 0 retries
100.00% of container copies found (7863 of 7863)
Sample represents 1.00% of the container partition space

Queried 2619 objects for dispersion reporting, 7s, 0 retries
There were 1763 partitions missing one copy.
77.56% of object copies found (6094 of 7857)
Sample represents 1.00% of the object partition space

You can see the health of the objects in the cluster has gone down significantly. Of course, I only have four devices in this test environment, in a production environment with many many devices the impact of one device change is much less. Next, I'll run the replicators to get everything put back into place and then rerun the dispersion report:

... start object replicators and monitor logs until they're caught up ...
$ swift-dispersion-report
Queried 2621 containers for dispersion reporting, 17s, 0 retries
100.00% of container copies found (7863 of 7863)
Sample represents 1.00% of the container partition space

Queried 2619 objects for dispersion reporting, 7s, 0 retries
100.00% of object copies found (7857 of 7857)
Sample represents 1.00% of the object partition space

You can also run the report for only containers or objects:

$ swift-dispersion-report --container-only
Queried 2621 containers for dispersion reporting, 17s, 0 retries
100.00% of container copies found (7863 of 7863)
Sample represents 1.00% of the container partition space

$ swift-dispersion-report --object-only
Queried 2619 objects for dispersion reporting, 7s, 0 retries
100.00% of object copies found (7857 of 7857)
Sample represents 1.00% of the object partition space

Alternatively, the dispersion report can also be output in json format. This allows it to be more easily consumed by third party utilities:

$ swift-dispersion-report -j
{"object": {"retries:": 0, "missing_two": 0, "copies_found": 7863, "missing_one": 0, "copies_expected": 7863, "pct_found": 100.0, "overlapping": 0, "missing_all": 0}, "container": {"retries:": 0, "missing_two": 0, "copies_found": 12534, "missing_one": 0, "copies_expected": 12534, "pct_found": 100.0, "overlapping": 15, "missing_all": 0}}

Geographically Distributed Clusters

Swift's default configuration is currently designed to work in a single region, where a region is defined as a group of machines with high-bandwidth, low-latency links between them. However, configuration options exist that make running a performant multi-region Swift cluster possible.

For the rest of this section, we will assume a two-region Swift cluster: region 1 in San Francisco (SF), and region 2 in New York (NY). Each region shall contain within it 3 zones, numbered 1, 2, and 3, for a total of 6 zones.

read_affinity

This setting makes the proxy server prefer local backend servers for GET and HEAD requests over non-local ones. For example, it is preferable for an SF proxy server to service object GET requests by talking to SF object servers, as the client will receive lower latency and higher throughput.

By default, Swift randomly chooses one of the three replicas to give to the client, thereby spreading the load evenly. In the case of a geographically-distributed cluster, the administrator is likely to prioritize keeping traffic local over even distribution of results. This is where the read_affinity setting comes in.

Example:

[app:proxy-server]
read_affinity = r1=100

This will make the proxy attempt to service GET and HEAD requests from backends in region 1 before contacting any backends in region 2. However, if no region 1 backends are available (due to replica placement, failed hardware, or other reasons), then the proxy will fall back to backend servers in other regions.

Example:

[app:proxy-server]
read_affinity = r1z1=100, r1=200

This will make the proxy attempt to service GET and HEAD requests from backends in region 1 zone 1, then backends in region 1, then any other backends. If a proxy is physically close to a particular zone or zones, this can provide bandwidth savings. For example, if a zone corresponds to servers in a particular rack, and the proxy server is in that same rack, then setting read_affinity to prefer reads from within the rack will result in less traffic between the top-of-rack switches.

The read_affinity setting may contain any number of region/zone specifiers; the priority number (after the equals sign) determines the ordering in which backend servers will be contacted. A lower number means higher priority.

Note that read_affinity only affects the ordering of primary nodes (see ring docs for definition of primary node), not the ordering of handoff nodes.

write_affinity and write_affinity_node_count

This setting makes the proxy server prefer local backend servers for object PUT requests over non-local ones. For example, it may be preferable for an SF proxy server to service object PUT requests by talking to SF object servers, as the client will receive lower latency and higher throughput. However, if this setting is used, note that a NY proxy server handling a GET request for an object that was PUT using write affinity may have to fetch it across the WAN link, as the object won't immediately have any replicas in NY. However, replication will move the object's replicas to their proper homes in both SF and NY.

Note that only object PUT requests are affected by the write_affinity setting; POST, GET, HEAD, DELETE, OPTIONS, and account/container PUT requests are not affected.

This setting lets you trade data distribution for throughput. If write_affinity is enabled, then object replicas will initially be stored all within a particular region or zone, thereby decreasing the quality of the data distribution, but the replicas will be distributed over fast WAN links, giving higher throughput to clients. Note that the replicators will eventually move objects to their proper, well-distributed homes.

The write_affinity setting is useful only when you don't typically read objects immediately after writing them. For example, consider a workload of mainly backups: if you have a bunch of machines in NY that periodically write backups to Swift, then odds are that you don't then immediately read those backups in SF. If your workload doesn't look like that, then you probably shouldn't use write_affinity.

The write_affinity_node_count setting is only useful in conjunction with write_affinity; it governs how many local object servers will be tried before falling back to non-local ones.

Example:

[app:proxy-server]
write_affinity = r1
write_affinity_node_count = 2 * replicas

Assuming 3 replicas, this configuration will make object PUTs try storing the object's replicas on up to 6 disks ("2 * replicas") in region 1 ("r1").

You should be aware that, if you have data coming into SF faster than your link to NY can transfer it, then your cluster's data distribution will get worse and worse over time as objects pile up in SF. If this happens, it is recommended to disable write_affinity and simply let object PUTs traverse the WAN link, as that will naturally limit the object growth rate to what your WAN link can handle.

Cluster Telemetry and Monitoring

Various metrics and telemetry can be obtained from the account, container, and object servers using the recon server middleware and the swift-recon cli. To do so update your account, container, or object servers pipelines to include recon and add the associated filter config.

object-server.conf sample:

[pipeline:main]
pipeline = recon object-server

[filter:recon]
use = egg:swift#recon
recon_cache_path = /var/cache/swift

container-server.conf sample:

[pipeline:main]
pipeline = recon container-server

[filter:recon]
use = egg:swift#recon
recon_cache_path = /var/cache/swift

account-server.conf sample:

[pipeline:main]
pipeline = recon account-server

[filter:recon]
use = egg:swift#recon
recon_cache_path = /var/cache/swift

The recon_cache_path simply sets the directory where stats for a few items will be stored. Depending on the method of deployment you may need to create this directory manually and ensure that swift has read/write access.

Finally, if you also wish to track asynchronous pending on your object servers you will need to setup a cronjob to run the swift-recon-cron script periodically on your object servers:

*/5 * * * * swift /usr/bin/swift-recon-cron /etc/swift/object-server.conf

Once the recon middleware is enabled, a GET request for "/recon/<metric>" to the backend object server will return a JSON-formatted response:

fhines@ubuntu:~$ curl -i http://localhost:6030/recon/async
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 20
Date: Tue, 18 Oct 2011 21:03:01 GMT

{"async_pending": 0}

Note that the default port for the object server is 6000, except on a Swift All-In-One installation, which uses 6010, 6020, 6030, and 6040.

The following metrics and telemetry are currently exposed:

Request URI	Description
/recon/load	returns 1,5, and 15 minute load average
/recon/mem	returns /proc/meminfo
/recon/mounted	returns ALL currently mounted filesystems
/recon/unmounted	returns all unmounted drives if mount_check = True
/recon/diskusage	returns disk utilization for storage devices
/recon/ringmd5	returns object/container/account ring md5sums
/recon/quarantined	returns # of quarantined objects/accounts/containers
/recon/sockstat	returns consumable info from /proc/net/sockstat\|6
/recon/devices	returns list of devices and devices dir i.e. /srv/node
/recon/async	returns count of async pending
/recon/replication	returns object replication times (for backward compatibility)
/recon/replication/<type>	returns replication info for given type (account, container, object)
/recon/auditor/<type>	returns auditor stats on last reported scan for given type (account, container, object)
/recon/updater/<type>	returns last updater sweep times for given type (container, object)

This information can also be queried via the swift-recon command line utility:

fhines@ubuntu:~$ swift-recon -h
Usage:
        usage: swift-recon <server_type> [-v] [--suppress] [-a] [-r] [-u] [-d]
        [-l] [--md5] [--auditor] [--updater] [--expirer] [--sockstat]

        <server_type>   account|container|object
        Defaults to object server.

        ex: swift-recon container -l --auditor


Options:
  -h, --help            show this help message and exit
  -v, --verbose         Print verbose info
  --suppress            Suppress most connection related errors
  -a, --async           Get async stats
  -r, --replication     Get replication stats
  --auditor             Get auditor stats
  --updater             Get updater stats
  --expirer             Get expirer stats
  -u, --unmounted       Check cluster for unmounted devices
  -d, --diskusage       Get disk usage stats
  -l, --loadstats       Get cluster load average stats
  -q, --quarantined     Get cluster quarantine stats
  --md5                 Get md5sum of servers ring and compare to local copy
  --sockstat            Get cluster socket usage stats
  --all                 Perform all checks. Equal to -arudlq --md5 --sockstat
  -z ZONE, --zone=ZONE  Only query servers in specified zone
  -t SECONDS, --timeout=SECONDS
                        Time to wait for a response from a server
  --swiftdir=SWIFTDIR   Default = /etc/swift

For example, to obtain container replication info from all hosts in zone "3":

fhines@ubuntu:~$ swift-recon container -r --zone 3
===============================================================================
--> Starting reconnaissance on 1 hosts
===============================================================================
[2012-04-02 02:45:48] Checking on replication
[failure] low: 0.000, high: 0.000, avg: 0.000, reported: 1
[success] low: 486.000, high: 486.000, avg: 486.000, reported: 1
[replication_time] low: 20.853, high: 20.853, avg: 20.853, reported: 1
[attempted] low: 243.000, high: 243.000, avg: 243.000, reported: 1

Reporting Metrics to StatsD

If you have a StatsD server running, Swift may be configured to send it real-time operational metrics. To enable this, set the following configuration entries (see the sample configuration files):

log_statsd_host = localhost
log_statsd_port = 8125
log_statsd_default_sample_rate = 1.0
log_statsd_sample_rate_factor = 1.0
log_statsd_metric_prefix =                [empty-string]

If log_statsd_host is not set, this feature is disabled. The default values for the other settings are given above.

The sample rate is a real number between 0 and 1 which defines the probability of sending a sample for any given event or timing measurement. This sample rate is sent with each sample to StatsD and used to multiply the value. For example, with a sample rate of 0.5, StatsD will multiply that counter's value by 2 when flushing the metric to an upstream monitoring system (Graphite, Ganglia, etc.).

Some relatively high-frequency metrics have a default sample rate less than one. If you want to override the default sample rate for all metrics whose default sample rate is not specified in the Swift source, you may set log_statsd_default_sample_rate to a value less than one. This is NOT recommended (see next paragraph). A better way to reduce StatsD load is to adjust log_statsd_sample_rate_factor to a value less than one. The log_statsd_sample_rate_factor is multiplied to any sample rate (either the global default or one specified by the actual metric logging call in the Swift source) prior to handling. In other words, this one tunable can lower the frequency of all StatsD logging by a proportional amount.

To get the best data, start with the default log_statsd_default_sample_rate and log_statsd_sample_rate_factor values of 1 and only lower log_statsd_sample_rate_factor if needed. The log_statsd_default_sample_rate should not be used and remains for backward compatibility only.

The metric prefix will be prepended to every metric sent to the StatsD server For example, with:

log_statsd_metric_prefix = proxy01

the metric proxy-server.errors would be sent to StatsD as proxy01.proxy-server.errors. This is useful for differentiating different servers when sending statistics to a central StatsD server. If you run a local StatsD server per node, you could configure a per-node metrics prefix there and leave log_statsd_metric_prefix blank.

Note that metrics reported to StatsD are counters or timing data (which are sent in units of milliseconds). StatsD usually expands timing data out to min, max, avg, count, and 90th percentile per timing metric, but the details of this behavior will depend on the configuration of your StatsD server. Some important "gauge" metrics may still need to be collected using another method. For example, the object-server.async_pendings StatsD metric counts the generation of async_pendings in real-time, but will not tell you the current number of async_pending container updates on disk at any point in time.

Note also that the set of metrics collected, their names, and their semantics are not locked down and will change over time. StatsD logging is currently in a "beta" stage and will continue to evolve.

Metrics for `account-auditor`:

Metric Name	Description
account-auditor.errors	Count of audit runs (across all account databases) which caught an Exception.
account-auditor.passes	Count of individual account databases which passed audit.
account-auditor.failures	Count of individual account databases which failed audit.
account-auditor.timing	Timing data for individual account database audits.

Metrics for `account-reaper`:

Metric Name	Description
account-reaper.errors	Count of devices failing the mount check.
account-reaper.timing	Timing data for each reap_account() call.
account-reaper.return_codes.X	Count of HTTP return codes from various operations (e.g. object listing, container deletion, etc.). The value for X is the first digit of the return code (2 for 201, 4 for 404, etc.).
account-reaper.containers_failures	Count of failures to delete a container.
account-reaper.containers_deleted	Count of containers successfully deleted.
account-reaper.containers_remaining	Count of containers which failed to delete with zero successes.
account-reaper.containers_possibly_remaining	Count of containers which failed to delete with at least one success.
account-reaper.objects_failures	Count of failures to delete an object.
account-reaper.objects_deleted	Count of objects successfully deleted.
account-reaper.objects_remaining	Count of objects which failed to delete with zero successes.
account-reaper.objects_possibly_remaining	Count of objects which failed to delete with at least one success.

Metrics for account-server ("Not Found" is not considered an error and requests which increment errors are not included in the timing data):

Metric Name	Description
account-server.DELETE.errors.timing	Timing data for each DELETE request resulting in an error: bad request, not mounted, missing timestamp.
account-server.DELETE.timing	Timing data for each DELETE request not resulting in an error.
account-server.PUT.errors.timing	Timing data for each PUT request resulting in an error: bad request, not mounted, conflict, recently-deleted.
account-server.PUT.timing	Timing data for each PUT request not resulting in an error.
account-server.HEAD.errors.timing	Timing data for each HEAD request resulting in an error: bad request, not mounted.
account-server.HEAD.timing	Timing data for each HEAD request not resulting in an error.
account-server.GET.errors.timing	Timing data for each GET request resulting in an error: bad request, not mounted, bad delimiter, account listing limit too high, bad accept header.
account-server.GET.timing	Timing data for each GET request not resulting in an error.
account-server.REPLICATE.errors.timing	Timing data for each REPLICATE request resulting in an error: bad request, not mounted.
account-server.REPLICATE.timing	Timing data for each REPLICATE request not resulting in an error.
account-server.POST.errors.timing	Timing data for each POST request resulting in an error: bad request, bad or missing timestamp, not mounted.
account-server.POST.timing	Timing data for each POST request not resulting in an error.

Metrics for `account-replicator`:

Metric Name	Description
account-replicator.diffs	Count of syncs handled by sending differing rows.
account-replicator.diff_caps	Count of "diffs" operations which failed because "max_diffs" was hit.
account-replicator.no_changes	Count of accounts found to be in sync.
account-replicator.hashmatches	Count of accounts found to be in sync via hash comparison (broker.merge_syncs was called).
account-replicator.rsyncs	Count of completely missing accounts which were sent via rsync.
account-replicator.remote_merges	Count of syncs handled by sending entire database via rsync.
account-replicator.attempts	Count of database replication attempts.
account-replicator.failures	Count of database replication attempts which failed due to corruption (quarantined) or inability to read as well as attempts to individual nodes which failed.
account-replicator.removes.<device>	Count of databases on <device> deleted because the delete_timestamp was greater than the put_timestamp and the database had no rows or because it was successfully sync'ed to other locations and doesn't belong here anymore.
account-replicator.successes	Count of replication attempts to an individual node which were successful.
account-replicator.timing	Timing data for each database replication attempt not resulting in a failure.

Metrics for `container-auditor`:

Metric Name	Description
container-auditor.errors	Incremented when an Exception is caught in an audit pass (only once per pass, max).
container-auditor.passes	Count of individual containers passing an audit.
container-auditor.failures	Count of individual containers failing an audit.
container-auditor.timing	Timing data for each container audit.

Metrics for `container-replicator`:

Metric Name	Description
container-replicator.diffs	Count of syncs handled by sending differing rows.
container-replicator.diff_caps	Count of "diffs" operations which failed because "max_diffs" was hit.
container-replicator.no_changes	Count of containers found to be in sync.
container-replicator.hashmatches	Count of containers found to be in sync via hash comparison (broker.merge_syncs was called).
container-replicator.rsyncs	Count of completely missing containers where were sent via rsync.
container-replicator.remote_merges	Count of syncs handled by sending entire database via rsync.
container-replicator.attempts	Count of database replication attempts.
container-replicator.failures	Count of database replication attempts which failed due to corruption (quarantined) or inability to read as well as attempts to individual nodes which failed.
container-replicator.removes.<device>	Count of databases deleted on <device> because the delete_timestamp was greater than the put_timestamp and the database had no rows or because it was successfully sync'ed to other locations and doesn't belong here anymore.
container-replicator.successes	Count of replication attempts to an individual node which were successful.
container-replicator.timing	Timing data for each database replication attempt not resulting in a failure.

Metrics for container-server ("Not Found" is not considered an error and requests which increment errors are not included in the timing data):

Metric Name	Description
container-server.DELETE.errors.timing	Timing data for DELETE request errors: bad request, not mounted, missing timestamp, conflict.
container-server.DELETE.timing	Timing data for each DELETE request not resulting in an error.
container-server.PUT.errors.timing	Timing data for PUT request errors: bad request, missing timestamp, not mounted, conflict.
container-server.PUT.timing	Timing data for each PUT request not resulting in an error.
container-server.HEAD.errors.timing	Timing data for HEAD request errors: bad request, not mounted.
container-server.HEAD.timing	Timing data for each HEAD request not resulting in an error.
container-server.GET.errors.timing	Timing data for GET request errors: bad request, not mounted, parameters not utf8, bad accept header.
container-server.GET.timing	Timing data for each GET request not resulting in an error.
container-server.REPLICATE.errors.timing	Timing data for REPLICATE request errors: bad request, not mounted.
container-server.REPLICATE.timing	Timing data for each REPLICATE request not resulting in an error.
container-server.POST.errors.timing	Timing data for POST request errors: bad request, bad x-container-sync-to, not mounted.
container-server.POST.timing	Timing data for each POST request not resulting in an error.

Metrics for `container-sync`:

Metric Name	Description
container-sync.skips	Count of containers skipped because they don't have sync'ing enabled.
container-sync.failures	Count of failures sync'ing of individual containers.
container-sync.syncs	Count of individual containers sync'ed successfully.
container-sync.deletes	Count of container database rows sync'ed by deletion.
container-sync.deletes.timing	Timing data for each container database row sychronization via deletion.
container-sync.puts	Count of container database rows sync'ed by PUTing.
container-sync.puts.timing	Timing data for each container database row synchronization via PUTing.

Metrics for `container-updater`:

Metric Name	Description
container-updater.successes	Count of containers which successfully updated their account.
container-updater.failures	Count of containers which failed to update their account.
container-updater.no_changes	Count of containers which didn't need to update their account.
container-updater.timing	Timing data for processing a container; only includes timing for containers which needed to update their accounts (i.e. "successes" and "failures" but not "no_changes").

Metrics for `object-auditor`:

Metric Name	Description
object-auditor.quarantines	Count of objects failing audit and quarantined.
object-auditor.errors	Count of errors encountered while auditing objects.
object-auditor.timing	Timing data for each object audit (does not include any rate-limiting sleep time for max_files_per_second, but does include rate-limiting sleep time for max_bytes_per_second).

Metrics for `object-expirer`:

Metric Name	Description
object-expirer.objects	Count of objects expired.
object-expirer.errors	Count of errors encountered while attempting to expire an object.
object-expirer.timing	Timing data for each object expiration attempt, including ones resulting in an error.

Metrics for `object-replicator`:

Metric Name	Description
object-replicator.partition.delete.count.<device>	A count of partitions on <device> which were replicated to another node because they didn't belong on this node. This metric is tracked per-device to allow for "quiescence detection" for object replication activity on each device.
object-replicator.partition.delete.timing	Timing data for partitions replicated to another node because they didn't belong on this node. This metric is not tracked per device.
object-replicator.partition.update.count.<device>	A count of partitions on <device> which were replicated to another node, but also belong on this node. As with delete.count, this metric is tracked per-device.
object-replicator.partition.update.timing	Timing data for partitions replicated which also belong on this node. This metric is not tracked per-device.
object-replicator.suffix.hashes	Count of suffix directories whose hash (of filenames) was recalculated.
object-replicator.suffix.syncs	Count of suffix directories replicated with rsync.

Metrics for `object-server`:

Metric Name	Description
object-server.quarantines	Count of objects (files) found bad and moved to quarantine.
object-server.async_pendings	Count of container updates saved as async_pendings (may result from PUT or DELETE requests).
object-server.POST.errors.timing	Timing data for POST request errors: bad request, missing timestamp, delete-at in past, not mounted.
object-server.POST.timing	Timing data for each POST request not resulting in an error.
object-server.PUT.errors.timing	Timing data for PUT request errors: bad request, not mounted, missing timestamp, object creation constraint violation, delete-at in past.
object-server.PUT.timeouts	Count of object PUTs which exceeded max_upload_time.
object-server.PUT.timing	Timing data for each PUT request not resulting in an error.
object-server.PUT.<device>.timing	Timing data per kB transferred (ms/kB) for each non-zero-byte PUT request on each device. Monitoring problematic devices, higher is bad.
object-server.GET.errors.timing	Timing data for GET request errors: bad request, not mounted, header timestamps before the epoch, precondition failed. File errors resulting in a quarantine are not counted here.
object-server.GET.timing	Timing data for each GET request not resulting in an error. Includes requests which couldn't find the object (including disk errors resulting in file quarantine).
object-server.HEAD.errors.timing	Timing data for HEAD request errors: bad request, not mounted.
object-server.HEAD.timing	Timing data for each HEAD request not resulting in an error. Includes requests which couldn't find the object (including disk errors resulting in file quarantine).
object-server.DELETE.errors.timing	Timing data for DELETE request errors: bad request, missing timestamp, not mounted, precondition failed. Includes requests which couldn't find or match the object.
object-server.DELETE.timing	Timing data for each DELETE request not resulting in an error.
object-server.REPLICATE.errors.timing	Timing data for REPLICATE request errors: bad request, not mounted.
object-server.REPLICATE.timing	Timing data for each REPLICATE request not resulting in an error.

Metrics for `object-updater`:

Metric Name	Description
object-updater.errors	Count of drives not mounted or async_pending files with an unexpected name.
object-updater.timing	Timing data for object sweeps to flush async_pending container updates. Does not include object sweeps which did not find an existing async_pending storage directory.
object-updater.quarantines	Count of async_pending container updates which were corrupted and moved to quarantine.
object-updater.successes	Count of successful container updates.
object-updater.failures	Count of failed container updates.
object-updater.unlinks	Count of async_pending files unlinked. An async_pending file is unlinked either when it is successfully processed or when the replicator sees that there is a newer async_pending file for the same object.

Metrics for proxy-server (in the table, <type> is the proxy-server controller responsible for the request and will be one of "account", "container", or "object"):

Metric Name	Description
proxy-server.errors	Count of errors encountered while serving requests before the controller type is determined. Includes invalid Content-Length, errors finding the internal controller to handle the request, invalid utf8, and bad URLs.
proxy-server.<type>.handoff_count	Count of node hand-offs; only tracked if log_handoffs is set in the proxy-server config.
proxy-server.<type>.handoff_all_count	Count of times only hand-off locations were utilized; only tracked if log_handoffs is set in the proxy-server config.
proxy-server.<type>.client_timeouts	Count of client timeouts (client did not read within client_timeout seconds during a GET or did not supply data within client_timeout seconds during a PUT).
proxy-server.<type>.client_disconnects	Count of detected client disconnects during PUT operations (does NOT include caught Exceptions in the proxy-server which caused a client disconnect).

Metrics for proxy-logging middleware (in the table, <type> is either the proxy-server controller responsible for the request: "account", "container", "object", or the string "SOS" if the request came from the Swift Origin Server middleware. The <verb> portion will be one of "GET", "HEAD", "POST", "PUT", "DELETE", "COPY", "OPTIONS", or "BAD_METHOD". The list of valid HTTP methods is configurable via the log_statsd_valid_http_methods config variable and the default setting yields the above behavior.

Metric Name	Description
proxy-server.<type>.<verb>.<status>.timing	Timing data for requests, start to finish. The <status> portion is the numeric HTTP status code for the request (e.g. "200" or "404").
proxy-server.<type>.GET.<status>.first-byte.timing	Timing data up to completion of sending the response headers (only for GET requests). <status> and <type> are as for the main timing metric.
proxy-server.<type>.<verb>.<status>.xfer	This counter metric is the sum of bytes transferred in (from clients) and out (to clients) for requests. The <type>, <verb>, and <status> portions of the metric are just like the main timing metric.

Metrics for tempauth middleware (in the table, <reseller_prefix> represents the actual configured reseller_prefix or "NONE" if the reseller_prefix is the empty string):

Metric Name	Description
tempauth.<reseller_prefix>.unauthorized	Count of regular requests which were denied with HTTPUnauthorized.
tempauth.<reseller_prefix>.forbidden	Count of regular requests which were denied with HTTPForbidden.
tempauth.<reseller_prefix>.token_denied	Count of token requests which were denied.
tempauth.<reseller_prefix>.errors	Count of errors.

Debugging Tips and Tools

When a request is made to Swift, it is given a unique transaction id. This id should be in every log line that has to do with that request. This can be useful when looking at all the services that are hit by a single request.

If you need to know where a specific account, container or object is in the cluster, swift-get-nodes will show the location where each replica should be.

If you are looking at an object on the server and need more info, swift-object-info will display the account, container, replica locations and metadata of the object.

If you are looking at a container on the server and need more info, swift-container-info will display all the information like the account, container, replica locations and metadata of the container.

If you are looking at an account on the server and need more info, swift-account-info will display the account, replica locations and metadata of the account.

If you want to audit the data for an account, swift-account-audit can be used to crawl the account, checking that all containers and objects can be found.

Managing Services

Swift services are generally managed with swift-init. the general usage is swift-init <service> <command>, where service is the swift service to manage (for example object, container, account, proxy) and command is one of:

Command	Description
start	Start the service
stop	Stop the service
restart	Restart the service
shutdown	Attempt to gracefully shutdown the service
reload	Attempt to gracefully restart the service

A graceful shutdown or reload will finish any current requests before completely stopping the old service. There is also a special case of swift-init all <command>, which will run the command for all swift services.

In cases where there are multiple configs for a service, a specific config can be managed with swift-init <service>.<config> <command>. For example, when a separate replication network is used, there might be /etc/swift/object-server/public.conf for the object server and /etc/swift/object-server/replication.conf for the replication services. In this case, the replication services could be restarted with swift-init object-server.replication restart.

Object Auditor

On system failures, the XFS file system can sometimes truncate files it's trying to write and produce zero-byte files. The object-auditor will catch these problems but in the case of a system crash it would be advisable to run an extra, less rate limited sweep to check for these specific files. You can run this command as follows: swift-object-auditor /path/to/object-server/config/file.conf once -z 1000 "-z" means to only check for zero-byte files at 1000 files per second.

At times it is useful to be able to run the object auditor on a specific device or set of devices. You can run the object-auditor as follows: swift-object-auditor /path/to/object-server/config/file.conf once --devices=sda,sdb

This will run the object auditor on only the sda and sdb devices. This param accepts a comma separated list of values.

Object Replicator

At times it is useful to be able to run the object replicator on a specific device or partition. You can run the object-replicator as follows: swift-object-replicator /path/to/object-server/config/file.conf once --devices=sda,sdb

This will run the object replicator on only the sda and sdb devices. You can likewise run that command with --partitions. Both params accept a comma separated list of values. If both are specified they will be ANDed together. These can only be run in "once" mode.

Swift Orphans

Swift Orphans are processes left over after a reload of a Swift server.

For example, when upgrading a proxy server you would probaby finish with a swift-init proxy-server reload or /etc/init.d/swift-proxy reload. This kills the parent proxy server process and leaves the child processes running to finish processing whatever requests they might be handling at the time. It then starts up a new parent proxy server process and its children to handle new incoming requests. This allows zero-downtime upgrades with no impact to existing requests.

The orphaned child processes may take a while to exit, depending on the length of the requests they were handling. However, sometimes an old process can be hung up due to some bug or hardware issue. In these cases, these orphaned processes will hang around forever. swift-orphans can be used to find and kill these orphans.

swift-orphans with no arguments will just list the orphans it finds that were started more than 24 hours ago. You shouldn't really check for orphans until 24 hours after you perform a reload, as some requests can take a long time to process. swift-orphans -k TERM will send the SIG_TERM signal to the orphans processes, or you can kill -TERM the pids yourself if you prefer.

You can run swift-orphans --help for more options.

Swift Oldies

Swift Oldies are processes that have just been around for a long time. There's nothing necessarily wrong with this, but it might indicate a hung process if you regularly upgrade and reload/restart services. You might have so many servers that you don't notice when a reload/restart fails; swift-oldies can help with this.

For example, if you upgraded and reloaded/restarted everything 2 days ago, and you've already cleaned up any orphans with swift-orphans, you can run swift-oldies -a 48 to find any Swift processes still around that were started more than 2 days ago and then investigate them accordingly.

Custom Log Handlers

Swift supports setting up custom log handlers for services by specifying a comma-separated list of functions to invoke when logging is setup. It does so via the log_custom_handlers configuration option. Logger hooks invoked are passed the same arguments as Swift's get_logger function (as well as the getLogger and LogAdapter object):

Name	Description
conf	Configuration dict to read settings from
name	Name of the logger received
log_to_console	(optional) Write log messages to console on stderr
log_route	Route for the logging received
fmt	Override log format received
logger	The logging.getLogger object
adapted_logger	The LogAdapter object

A basic example that sets up a custom logger might look like the following:

def my_logger(conf, name, log_to_console, log_route, fmt, logger,
              adapted_logger):
    my_conf_opt = conf.get('some_custom_setting')
    my_handler = third_party_logstore_handler(my_conf_opt)
    logger.addHandler(my_handler)

See custom-logger-hooks-label for sample use cases.

64 KiB Raw Blame History

Administrator's Guide

Defining Storage Policies

Managing the Rings

Scripting Ring Creation

Handling System Updates

Handling Drive Failure

Handling Server Failure

Detecting Failed Drives

Cluster Health

Geographically Distributed Clusters

read_affinity

write_affinity and write_affinity_node_count

Cluster Telemetry and Monitoring

Reporting Metrics to StatsD

Debugging Tips and Tools

Managing Services

Object Auditor

Object Replicator

Swift Orphans

Swift Oldies

Custom Log Handlers

64 KiB

Raw Blame History