Cluster health monitoring docs

2010-08-17 19:52:50 +00:00 · 2010-08-17 19:52:50 +00:00 · cac53bc84a
commit cac53bc84a
parent db522deba9 faa96c6aed
1 changed files with 94 additions and 2 deletions
--- a/doc/source/admin_guide.rst
+++ b/doc/source/admin_guide.rst
@ -108,8 +108,100 @@ different distro or OS, some care should be taken before using in production.
 Cluster Health
 --------------
-TODO: Greg, add docs here about how to use swift-stats-populate, and
+There is a swift-stats-report tool for measuring overall cluster health. This
-swift-stats-report
+is accomplished by checking if a set of deliberately distributed containers and
 objects are currently in their proper places within the cluster.
 For instance, a common deployment has three replicas of each object. The health
 of that object can be measured by checking if each replica is in its proper
 place. If only 2 of the 3 is in place the object's heath can be said to be at
 66.66%, where 100% would be perfect.
 A single object's health, especially an older object, usually reflects the
 health of that entire partition the object is in. If we make enough objects on
 a distinct percentage of the partitions in the cluster, we can get a pretty
 valid estimate of the overall cluster health. In practice, about 1% partition
 coverage seems to balance well between accuracy and the amount of time it takes
 to gather results.
 The first thing that needs to be done to provide this health value is create a
 new account solely for this usage. Next, we need to place the containers and
 objects throughout the system so that they are on distinct partitions. The
 swift-stats-populate tool does this by making up random container and object
 names until they fall on distinct partitions. Last, and repeatedly for the life
 of the cluster, we need to run the swift-stats-report tool to check the health
 of each of these containers and objects.
 These tools need direct access to the entire cluster and to the ring files
 (installing them on an auth server or a proxy server will probably do). Both
 swift-stats-populate and swift-stats-report use the same configuration file,
 /etc/swift/stats.conf. Example conf file::
    [stats]
    auth_url = http://saio:11000/v1.0
    auth_user = test:tester
    auth_key = testing
 There are also options for the conf file for specifying the dispersion coverage
 (defaults to 1%), retries, concurrency, CSV output file, etc. though usually
 the defaults are fine.
 Once the configuration is in place, run `swift-stats-populate -d` to populate
 the containers and objects throughout the cluster.
 Now that those containers and objects are in place, you can run
 `swift-stats-report -d` to get a dispersion report, or the overall health of
 the cluster. Here is an example of a cluster in perfect health::
    $ swift-stats-report -d
    Queried 2621 containers for dispersion reporting, 19s, 0 retries
    100.00% of container copies found (7863 of 7863)
    Sample represents 1.00% of the container partition space
    Queried 2619 objects for dispersion reporting, 7s, 0 retries
    100.00% of object copies found (7857 of 7857)
    Sample represents 1.00% of the object partition space
 Now I'll deliberately double the weight of a device in the object ring (with
 replication turned off) and rerun the dispersion report to show what impact
 that has::
    $ swift-ring-builder object.builder set_weight d0 200
    $ swift-ring-builder object.builder rebalance
    ...
    $ swift-stats-report -d
    Queried 2621 containers for dispersion reporting, 8s, 0 retries
    100.00% of container copies found (7863 of 7863)
    Sample represents 1.00% of the container partition space
    Queried 2619 objects for dispersion reporting, 7s, 0 retries
    There were 1763 partitions missing one copy.
    77.56% of object copies found (6094 of 7857)
    Sample represents 1.00% of the object partition space
 You can see the health of the objects in the cluster has gone down
 significantly. Of course, I only have four devices in this test environment, in
 a production environment with many many devices the impact of one device change
 is much less. Next, I'll run the replicators to get everything put back into
 place and then rerun the dispersion report::
    ... start object replicators and monitor logs until they're caught up ...
    $ swift-stats-report -d
    Queried 2621 containers for dispersion reporting, 17s, 0 retries
    100.00% of container copies found (7863 of 7863)
    Sample represents 1.00% of the container partition space
    Queried 2619 objects for dispersion reporting, 7s, 0 retries
    100.00% of object copies found (7857 of 7857)
    Sample represents 1.00% of the object partition space
 So that's a summation of how to use swift-stats-report to monitor the health of
 a cluster. There are a few other things it can do, such as performance
 monitoring, but those are currently in their infancy and little used. For
 instance, you can run `swift-stats-populate -p` and `swift-stats-report -p` to
 get performance timings (warning: the initial populate takes a while). These
 timings are dumped into a CSV file (/etc/swift/stats.csv by default) and can
 then be graphed to see how cluster performance is trending.
 ------------------------
 Debugging Tips and Tools