Update admin guide on handling drive failures

Simply replacing a failed disk requires a very long time if the ring is not changed, because all data will be replicated to a single new disk. This extends the time to recover from missing replicas, and becomes even more important with bigger disks. This patch updates the doc to include a faster alternative by setting the weight of a failed disk to 0. In this case the partitions from the failed disk are distributed and replicated to the remaining disks in the cluster, and because each disk gets only a fraction of the partitions it's also much faster. Change-Id: I16617756359771ad89ca5d4690b58a014f481d9b
2014-10-29 10:34:53 +00:00 · 2014-10-29 10:34:53 +00:00 · 83030b921d
commit 83030b921d
parent dd3564a587
1 changed files with 13 additions and 2 deletions
--- a/doc/source/admin_guide.rst
+++ b/doc/source/admin_guide.rst
@ -145,9 +145,20 @@ then it is just best to replace the drive, format it, remount it, and let
 replication fill it up.

 If the drive can't be replaced immediately, then it is best to leave it
-unmounted, and remove the drive from the ring. This will allow all the
+unmounted, and set the device weight to 0. This will allow all the
 replicas that were on that drive to be replicated elsewhere until the drive
-is replaced.  Once the drive is replaced, it can be re-added to the ring.
+is replaced. Once the drive is replaced, the device weight can be increased
+again. Setting the device weight to 0 instead of removing the drive from the
+ring gives Swift the chance to replicate data from the failing disk too (in case
+it is still possible to read some of the data).
+
+Setting the device weight to 0 (or removing a failed drive from the ring) has
+another benefit: all partitions that were stored on the failed drive are
+distributed over the remaining disks in the cluster, and each disk only needs to
+store a few new partitions. This is much faster compared to replicating all
+partitions to a single, new disk. It decreases the time to recover from a
+degraded number of replicas significantly, and becomes more and more important
+with bigger disks.

 -----------------------
 Handling Server Failure