Add doc entry to check partition count

An high or increasing partition count due to storing handoffs can have some severe side-effects, and replication might never be able to catch up. This patch adds a note to the admin_guide how to check this. Change-Id: Ib4e161d68f1a82236dbf5fac13ef9a13ac4bbf18
2016-06-09 06:17:22 +00:00 · 2016-06-09 06:17:22 +00:00 · 699953508a
commit 699953508a
parent 11c5ef7d22
1 changed files with 80 additions and 1 deletions
--- a/doc/source/admin_guide.rst
+++ b/doc/source/admin_guide.rst
@ -617,13 +617,90 @@ have 6 replicas in region 1.


 You should be aware that, if you have data coming into SF faster than
-your link to NY can transfer it, then your cluster's data distribution
+your replicators are transferring it to NY, then your cluster's data distribution
 will get worse and worse over time as objects pile up in SF. If this
 happens, it is recommended to disable write_affinity and simply let
 object PUTs traverse the WAN link, as that will naturally limit the
 object growth rate to what your WAN link can handle.


+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Checking handoff partition distribution
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+You can check if handoff partitions are piling up on a server by
+comparing the expected number of partitions with the actual number on
+your disks. First get the number of partitions that are currently
+assigned to a server using the ``dispersion`` command from
+``swift-ring-builder``::
+
+    swift-ring-builder sample.builder dispersion --verbose
+    Dispersion is 0.000000, Balance is 0.000000, Overload is 0.00%
+    Required overload is 0.000000%
+    --------------------------------------------------------------------------
+    Tier                           Parts      %    Max     0     1     2     3
+    --------------------------------------------------------------------------
+    r1                              8192   0.00      2     0     0  8192     0
+    r1z1                            4096   0.00      1  4096  4096     0     0
+    r1z1-172.16.10.1                4096   0.00      1  4096  4096     0     0
+    r1z1-172.16.10.1/sda1           4096   0.00      1  4096  4096     0     0
+    r1z2                            4096   0.00      1  4096  4096     0     0
+    r1z2-172.16.10.2                4096   0.00      1  4096  4096     0     0
+    r1z2-172.16.10.2/sda1           4096   0.00      1  4096  4096     0     0
+    r1z3                            4096   0.00      1  4096  4096     0     0
+    r1z3-172.16.10.3                4096   0.00      1  4096  4096     0     0
+    r1z3-172.16.10.3/sda1           4096   0.00      1  4096  4096     0     0
+    r1z4                            4096   0.00      1  4096  4096     0     0
+    r1z4-172.16.20.4                4096   0.00      1  4096  4096     0     0
+    r1z4-172.16.20.4/sda1           4096   0.00      1  4096  4096     0     0
+    r2                              8192   0.00      2     0  8192     0     0
+    r2z1                            4096   0.00      1  4096  4096     0     0
+    r2z1-172.16.20.1                4096   0.00      1  4096  4096     0     0
+    r2z1-172.16.20.1/sda1           4096   0.00      1  4096  4096     0     0
+    r2z2                            4096   0.00      1  4096  4096     0     0
+    r2z2-172.16.20.2                4096   0.00      1  4096  4096     0     0
+    r2z2-172.16.20.2/sda1           4096   0.00      1  4096  4096     0     0
+
+As you can see from the output, each server should store 4096 partitions, and
+each region should store 8192 partitions. This example used a partition power
+of 13 and 3 replicas.
+
+With write_affinity enabled it is expected to have a higher number of
+partitions on disk compared to the value reported by the
+swift-ring-builder dispersion command. The number of additional (handoff)
+partitions in region r1 depends on your cluster size, the amount
+of incoming data as well as the replication speed.
+
+Let's use the example from above with 6 nodes in 2 regions, and write_affinity
+configured to write to region r1 first. `swift-ring-builder` reported that
+each node should store 4096 partitions::
+
+ Expected partitions for region r2:                                      8192
+ Handoffs stored across 4 nodes in region r1:                 8192 / 4 = 2048
+ Maximum number of partitions on each server in region r1: 2048 + 4096 = 6144
+
+Worst case is that handoff partitions in region 1 are populated with new
+object replicas faster than replication is able to move them to region 2.
+In that case you will see ~ 6144 partitions per
+server in region r1. Your actual number should be lower and
+between 4096 and 6144 partitions (preferably on the lower side).
+
+Now count the number of object partitions on a given server in region 1,
+for example on 172.16.10.1.  Note that the pathnames might be
+different; `/srv/node/` is the default mount location, and `objects`
+applies only to storage policy 0 (storage policy 1 would use
+`objects-1` and so on)::
+
+    find -L /srv/node/ -maxdepth 3 -type d -wholename "*objects/*" | wc -l
+
+If this number is always on the upper end of the expected partition
+number range (4096 to 6144) or increasing you should check your
+replication speed and maybe even disable write_affinity.
+Please refer to the next section how to collect metrics from Swift, and
+especially :ref:`swift-recon -r <recon-replication>` how to check replication
+stats.
+
+
 --------------------------------
 Cluster Telemetry and Monitoring
 --------------------------------
@ -748,6 +825,8 @@ This information can also be queried via the swift-recon command line utility::
                            Time to wait for a response from a server
      --swiftdir=SWIFTDIR   Default = /etc/swift

+.. _recon-replication:
+
 For example, to obtain container replication info from all hosts in zone "3"::

    fhines@ubuntu:~$ swift-recon container -r --zone 3