From 699953508ad1fd82c221e57bccfb1de8bf7a7e31 Mon Sep 17 00:00:00 2001 From: Christian Schwede Date: Thu, 9 Jun 2016 06:17:22 +0000 Subject: [PATCH] Add doc entry to check partition count An high or increasing partition count due to storing handoffs can have some severe side-effects, and replication might never be able to catch up. This patch adds a note to the admin_guide how to check this. Change-Id: Ib4e161d68f1a82236dbf5fac13ef9a13ac4bbf18 --- doc/source/admin_guide.rst | 81 +++++++++++++++++++++++++++++++++++++- 1 file changed, 80 insertions(+), 1 deletion(-) diff --git a/doc/source/admin_guide.rst b/doc/source/admin_guide.rst index 4a5c2db3a4..91ee2d00c3 100644 --- a/doc/source/admin_guide.rst +++ b/doc/source/admin_guide.rst @@ -617,13 +617,90 @@ have 6 replicas in region 1. You should be aware that, if you have data coming into SF faster than -your link to NY can transfer it, then your cluster's data distribution +your replicators are transferring it to NY, then your cluster's data distribution will get worse and worse over time as objects pile up in SF. If this happens, it is recommended to disable write_affinity and simply let object PUTs traverse the WAN link, as that will naturally limit the object growth rate to what your WAN link can handle. +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Checking handoff partition distribution +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +You can check if handoff partitions are piling up on a server by +comparing the expected number of partitions with the actual number on +your disks. First get the number of partitions that are currently +assigned to a server using the ``dispersion`` command from +``swift-ring-builder``:: + + swift-ring-builder sample.builder dispersion --verbose + Dispersion is 0.000000, Balance is 0.000000, Overload is 0.00% + Required overload is 0.000000% + -------------------------------------------------------------------------- + Tier Parts % Max 0 1 2 3 + -------------------------------------------------------------------------- + r1 8192 0.00 2 0 0 8192 0 + r1z1 4096 0.00 1 4096 4096 0 0 + r1z1-172.16.10.1 4096 0.00 1 4096 4096 0 0 + r1z1-172.16.10.1/sda1 4096 0.00 1 4096 4096 0 0 + r1z2 4096 0.00 1 4096 4096 0 0 + r1z2-172.16.10.2 4096 0.00 1 4096 4096 0 0 + r1z2-172.16.10.2/sda1 4096 0.00 1 4096 4096 0 0 + r1z3 4096 0.00 1 4096 4096 0 0 + r1z3-172.16.10.3 4096 0.00 1 4096 4096 0 0 + r1z3-172.16.10.3/sda1 4096 0.00 1 4096 4096 0 0 + r1z4 4096 0.00 1 4096 4096 0 0 + r1z4-172.16.20.4 4096 0.00 1 4096 4096 0 0 + r1z4-172.16.20.4/sda1 4096 0.00 1 4096 4096 0 0 + r2 8192 0.00 2 0 8192 0 0 + r2z1 4096 0.00 1 4096 4096 0 0 + r2z1-172.16.20.1 4096 0.00 1 4096 4096 0 0 + r2z1-172.16.20.1/sda1 4096 0.00 1 4096 4096 0 0 + r2z2 4096 0.00 1 4096 4096 0 0 + r2z2-172.16.20.2 4096 0.00 1 4096 4096 0 0 + r2z2-172.16.20.2/sda1 4096 0.00 1 4096 4096 0 0 + +As you can see from the output, each server should store 4096 partitions, and +each region should store 8192 partitions. This example used a partition power +of 13 and 3 replicas. + +With write_affinity enabled it is expected to have a higher number of +partitions on disk compared to the value reported by the +swift-ring-builder dispersion command. The number of additional (handoff) +partitions in region r1 depends on your cluster size, the amount +of incoming data as well as the replication speed. + +Let's use the example from above with 6 nodes in 2 regions, and write_affinity +configured to write to region r1 first. `swift-ring-builder` reported that +each node should store 4096 partitions:: + + Expected partitions for region r2: 8192 + Handoffs stored across 4 nodes in region r1: 8192 / 4 = 2048 + Maximum number of partitions on each server in region r1: 2048 + 4096 = 6144 + +Worst case is that handoff partitions in region 1 are populated with new +object replicas faster than replication is able to move them to region 2. +In that case you will see ~ 6144 partitions per +server in region r1. Your actual number should be lower and +between 4096 and 6144 partitions (preferably on the lower side). + +Now count the number of object partitions on a given server in region 1, +for example on 172.16.10.1. Note that the pathnames might be +different; `/srv/node/` is the default mount location, and `objects` +applies only to storage policy 0 (storage policy 1 would use +`objects-1` and so on):: + + find -L /srv/node/ -maxdepth 3 -type d -wholename "*objects/*" | wc -l + +If this number is always on the upper end of the expected partition +number range (4096 to 6144) or increasing you should check your +replication speed and maybe even disable write_affinity. +Please refer to the next section how to collect metrics from Swift, and +especially :ref:`swift-recon -r ` how to check replication +stats. + + -------------------------------- Cluster Telemetry and Monitoring -------------------------------- @@ -748,6 +825,8 @@ This information can also be queried via the swift-recon command line utility:: Time to wait for a response from a server --swiftdir=SWIFTDIR Default = /etc/swift +.. _recon-replication: + For example, to obtain container replication info from all hosts in zone "3":: fhines@ubuntu:~$ swift-recon container -r --zone 3