diff --git a/doc/source/admin_guide.rst b/doc/source/admin_guide.rst index aba899f5fa..a39d164534 100644 --- a/doc/source/admin_guide.rst +++ b/doc/source/admin_guide.rst @@ -108,8 +108,100 @@ different distro or OS, some care should be taken before using in production. Cluster Health -------------- -TODO: Greg, add docs here about how to use swift-stats-populate, and -swift-stats-report +There is a swift-stats-report tool for measuring overall cluster health. This +is accomplished by checking if a set of deliberately distributed containers and +objects are currently in their proper places within the cluster. + +For instance, a common deployment has three replicas of each object. The health +of that object can be measured by checking if each replica is in its proper +place. If only 2 of the 3 is in place the object's heath can be said to be at +66.66%, where 100% would be perfect. + +A single object's health, especially an older object, usually reflects the +health of that entire partition the object is in. If we make enough objects on +a distinct percentage of the partitions in the cluster, we can get a pretty +valid estimate of the overall cluster health. In practice, about 1% partition +coverage seems to balance well between accuracy and the amount of time it takes +to gather results. + +The first thing that needs to be done to provide this health value is create a +new account solely for this usage. Next, we need to place the containers and +objects throughout the system so that they are on distinct partitions. The +swift-stats-populate tool does this by making up random container and object +names until they fall on distinct partitions. Last, and repeatedly for the life +of the cluster, we need to run the swift-stats-report tool to check the health +of each of these containers and objects. + +These tools need direct access to the entire cluster and to the ring files +(installing them on an auth server or a proxy server will probably do). Both +swift-stats-populate and swift-stats-report use the same configuration file, +/etc/swift/stats.conf. Example conf file:: + + [stats] + auth_url = http://saio:11000/v1.0 + auth_user = test:tester + auth_key = testing + +There are also options for the conf file for specifying the dispersion coverage +(defaults to 1%), retries, concurrency, CSV output file, etc. though usually +the defaults are fine. + +Once the configuration is in place, run `swift-stats-populate -d` to populate +the containers and objects throughout the cluster. + +Now that those containers and objects are in place, you can run +`swift-stats-report -d` to get a dispersion report, or the overall health of +the cluster. Here is an example of a cluster in perfect health:: + + $ swift-stats-report -d + Queried 2621 containers for dispersion reporting, 19s, 0 retries + 100.00% of container copies found (7863 of 7863) + Sample represents 1.00% of the container partition space + + Queried 2619 objects for dispersion reporting, 7s, 0 retries + 100.00% of object copies found (7857 of 7857) + Sample represents 1.00% of the object partition space + +Now I'll deliberately double the weight of a device in the object ring (with +replication turned off) and rerun the dispersion report to show what impact +that has:: + + $ swift-ring-builder object.builder set_weight d0 200 + $ swift-ring-builder object.builder rebalance + ... + $ swift-stats-report -d + Queried 2621 containers for dispersion reporting, 8s, 0 retries + 100.00% of container copies found (7863 of 7863) + Sample represents 1.00% of the container partition space + + Queried 2619 objects for dispersion reporting, 7s, 0 retries + There were 1763 partitions missing one copy. + 77.56% of object copies found (6094 of 7857) + Sample represents 1.00% of the object partition space + +You can see the health of the objects in the cluster has gone down +significantly. Of course, I only have four devices in this test environment, in +a production environment with many many devices the impact of one device change +is much less. Next, I'll run the replicators to get everything put back into +place and then rerun the dispersion report:: + + ... start object replicators and monitor logs until they're caught up ... + $ swift-stats-report -d + Queried 2621 containers for dispersion reporting, 17s, 0 retries + 100.00% of container copies found (7863 of 7863) + Sample represents 1.00% of the container partition space + + Queried 2619 objects for dispersion reporting, 7s, 0 retries + 100.00% of object copies found (7857 of 7857) + Sample represents 1.00% of the object partition space + +So that's a summation of how to use swift-stats-report to monitor the health of +a cluster. There are a few other things it can do, such as performance +monitoring, but those are currently in their infancy and little used. For +instance, you can run `swift-stats-populate -p` and `swift-stats-report -p` to +get performance timings (warning: the initial populate takes a while). These +timings are dumped into a CSV file (/etc/swift/stats.csv by default) and can +then be graphed to see how cluster performance is trending. ------------------------ Debugging Tips and Tools