From 83030b921dd83a84a2e966c88156e64d30fb9c24 Mon Sep 17 00:00:00 2001 From: Christian Schwede Date: Wed, 29 Oct 2014 10:34:53 +0000 Subject: [PATCH] Update admin guide on handling drive failures Simply replacing a failed disk requires a very long time if the ring is not changed, because all data will be replicated to a single new disk. This extends the time to recover from missing replicas, and becomes even more important with bigger disks. This patch updates the doc to include a faster alternative by setting the weight of a failed disk to 0. In this case the partitions from the failed disk are distributed and replicated to the remaining disks in the cluster, and because each disk gets only a fraction of the partitions it's also much faster. Change-Id: I16617756359771ad89ca5d4690b58a014f481d9b --- doc/source/admin_guide.rst | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/doc/source/admin_guide.rst b/doc/source/admin_guide.rst index 9c74e0a1e0..78c5b2bec7 100644 --- a/doc/source/admin_guide.rst +++ b/doc/source/admin_guide.rst @@ -145,9 +145,20 @@ then it is just best to replace the drive, format it, remount it, and let replication fill it up. If the drive can't be replaced immediately, then it is best to leave it -unmounted, and remove the drive from the ring. This will allow all the +unmounted, and set the device weight to 0. This will allow all the replicas that were on that drive to be replicated elsewhere until the drive -is replaced. Once the drive is replaced, it can be re-added to the ring. +is replaced. Once the drive is replaced, the device weight can be increased +again. Setting the device weight to 0 instead of removing the drive from the +ring gives Swift the chance to replicate data from the failing disk too (in case +it is still possible to read some of the data). + +Setting the device weight to 0 (or removing a failed drive from the ring) has +another benefit: all partitions that were stored on the failed drive are +distributed over the remaining disks in the cluster, and each disk only needs to +store a few new partitions. This is much faster compared to replicating all +partitions to a single, new disk. It decreases the time to recover from a +degraded number of replicas significantly, and becomes more and more important +with bigger disks. ----------------------- Handling Server Failure