Update admin guide on handling drive failures

Simply replacing a failed disk requires a very long time if the ring is not
changed, because all data will be replicated to a single new disk. This extends
the time to recover from missing replicas, and becomes even more important with
bigger disks.

This patch updates the doc to include a faster alternative by setting the weight
of a failed disk to 0.  In this case the partitions from the failed disk are
distributed and replicated to the remaining disks in the cluster, and because
each disk gets only a fraction of the partitions it's also much faster.

Change-Id: I16617756359771ad89ca5d4690b58a014f481d9b
This commit is contained in:
Christian Schwede 2014-10-29 10:34:53 +00:00
parent dd3564a587
commit 83030b921d

View File

@ -145,9 +145,20 @@ then it is just best to replace the drive, format it, remount it, and let
replication fill it up.
If the drive can't be replaced immediately, then it is best to leave it
unmounted, and remove the drive from the ring. This will allow all the
unmounted, and set the device weight to 0. This will allow all the
replicas that were on that drive to be replicated elsewhere until the drive
is replaced. Once the drive is replaced, it can be re-added to the ring.
is replaced. Once the drive is replaced, the device weight can be increased
again. Setting the device weight to 0 instead of removing the drive from the
ring gives Swift the chance to replicate data from the failing disk too (in case
it is still possible to read some of the data).
Setting the device weight to 0 (or removing a failed drive from the ring) has
another benefit: all partitions that were stored on the failed drive are
distributed over the remaining disks in the cluster, and each disk only needs to
store a few new partitions. This is much faster compared to replicating all
partitions to a single, new disk. It decreases the time to recover from a
degraded number of replicas significantly, and becomes more and more important
with bigger disks.
-----------------------
Handling Server Failure