Update admin guide on handling drive failures
Simply replacing a failed disk requires a very long time if the ring is not changed, because all data will be replicated to a single new disk. This extends the time to recover from missing replicas, and becomes even more important with bigger disks. This patch updates the doc to include a faster alternative by setting the weight of a failed disk to 0. In this case the partitions from the failed disk are distributed and replicated to the remaining disks in the cluster, and because each disk gets only a fraction of the partitions it's also much faster. Change-Id: I16617756359771ad89ca5d4690b58a014f481d9b
This commit is contained in:
parent
dd3564a587
commit
83030b921d
@ -145,9 +145,20 @@ then it is just best to replace the drive, format it, remount it, and let
|
||||
replication fill it up.
|
||||
|
||||
If the drive can't be replaced immediately, then it is best to leave it
|
||||
unmounted, and remove the drive from the ring. This will allow all the
|
||||
unmounted, and set the device weight to 0. This will allow all the
|
||||
replicas that were on that drive to be replicated elsewhere until the drive
|
||||
is replaced. Once the drive is replaced, it can be re-added to the ring.
|
||||
is replaced. Once the drive is replaced, the device weight can be increased
|
||||
again. Setting the device weight to 0 instead of removing the drive from the
|
||||
ring gives Swift the chance to replicate data from the failing disk too (in case
|
||||
it is still possible to read some of the data).
|
||||
|
||||
Setting the device weight to 0 (or removing a failed drive from the ring) has
|
||||
another benefit: all partitions that were stored on the failed drive are
|
||||
distributed over the remaining disks in the cluster, and each disk only needs to
|
||||
store a few new partitions. This is much faster compared to replicating all
|
||||
partitions to a single, new disk. It decreases the time to recover from a
|
||||
degraded number of replicas significantly, and becomes more and more important
|
||||
with bigger disks.
|
||||
|
||||
-----------------------
|
||||
Handling Server Failure
|
||||
|
Loading…
Reference in New Issue
Block a user