Merge "Ring Overview updates"

2016-01-26 00:42:35 +00:00 · 2016-01-26 00:42:35 +00:00 · 08f3abcec2
commit 08f3abcec2
parent 2501eaec0a 78cfb137d5
1 changed files with 168 additions and 87 deletions
--- a/doc/source/overview_ring.rst
+++ b/doc/source/overview_ring.rst
@ -4,9 +4,9 @@ The Rings
 The rings determine where data should reside in the cluster. There is a
 separate ring for account databases, container databases, and individual
-objects but each ring works in the same way. These rings are externally
+object storage policies but each ring works in the same way. These rings are
-managed, in that the server processes themselves do not modify the rings, they
+externally managed, in that the server processes themselves do not modify the
-are instead given new rings modified by other tools.
+rings, they are instead given new rings modified by other tools.
 The ring uses a configurable number of bits from a path's MD5 hash as a
 partition index that designates a device. The number of bits kept from the hash
@ -18,10 +18,25 @@ cluster all at once.
 Another configurable value is the replica count, which indicates how many of
 the partition->device assignments comprise a single ring. For a given partition
-number, each replica's device will not be in the same zone as any other
+number, each replica will be assigned to a different device in the ring.
-replica's device. Zones can be used to group devices based on physical
+
-locations, power separations, network separations, or any other attribute that
+Devices are added to the ring to describe the capacity available for
-would lessen multiple replicas being unavailable at the same time.
+part-replica assignment.  Devices are placed into failure domains consisting
 of region, zone, and server.  Regions can be used to describe geo-graphically
 systems characterized by lower-bandwidth or higher latency between machines in
 different regions.  Many rings will consist of only a single region.  Zones
 can be used to group devices based on physical locations, power separations,
 network separations, or any other attribute that would lessen multiple
 replicas being unavailable at the same time.
 Devices are given a weight which describes relative weight of the device in
 comparison to other devices.
 When building a ring all of each part's replicas will be assigned to devices
 according to their weight.  Additionally, each replica of a part will attempt
 to be assigned to a device who's failure domain does not already have a
 replica for the part.  Only a single replica of a part may be assigned to each
 device - you must have as many devices as replicas.
 ------------
 Ring Builder
@ -91,8 +106,7 @@ Note: The list of devices may contain holes, or indexes set to None, for
 devices that have been removed from the cluster. Generally, device ids are not
 reused. Also, some devices may be temporarily disabled by setting their weight
 to 0.0. To obtain a list of active devices (for uptime polling, for example)
-the Python code would look like: ``devices = [device for device in self.devs if
+the Python code would look like: ``devices = list(self._iter_devs())``
 device and device['weight']]``
 *************************
 Partition Assignment List
@ -108,14 +122,24 @@ So, to create a list of device dictionaries assigned to a partition, the Python
 code would look like: ``devices = [self.devs[part2dev_id[partition]] for
 part2dev_id in self._replica2part2dev_id]``
 That code is a little simplistic, as it does not account for the
 removal of duplicate devices. If a ring has more replicas than
 devices, then a partition will have more than one replica on one
 device; that's simply the pigeonhole principle at work.
 array('H') is used for memory conservation as there may be millions of
 partitions.
 *********************
 Partition Shift Value
 *********************
 The partition shift value is known internally to the Ring class as _part_shift.
 This value used to shift an MD5 hash to calculate the partition on which the
 data for that hash should reside. Only the top four bytes of the hash is used
 in this process. For example, to compute the partition for the path
 /account/container/object the Python code might look like: ``partition =
 unpack_from('>I', md5('/account/container/object').digest())[0] >>
 self._part_shift``
 For a ring generated with part_power P, the partition shift value is
 32 - P.
 *******************
 Fractional Replicas
 *******************
@ -130,6 +154,21 @@ for the ring. This means that some partitions will have more replicas than
 others. For example, if a ring has 3.25 replicas, then 25% of its partitions
 will have four replicas, while the remaining 75% will have just three.
 **********
 Dispersion
 **********
 With each rebalance, the ring builder calculates a dispersion metric. This is
 the percentage of partitions in the ring that have too many replicas within a
 particular failure domain.
 For example, if you have three servers in a cluster but two replicas for a
 partition get placed onto the same server, that partition will count towards
 the dispersion metric.
 A lower dispersion value is better, and the value can be used to find the
 proper value for "overload".
 ********
 Overload
 ********
@ -168,74 +207,118 @@ on them than the disks in nodes A and B. If 80% full is the warning
 threshold for the cluster, node C's disks will reach 80% full while A
 and B's disks are only 72.7% full.
-**********
+-------------------------------
-Dispersion
+Partition & Replica Terminology
-**********
+-------------------------------
-With each rebalance, the ring builder calculates a dispersion metric. This is
+All descriptions of consistent hashing describe the process of breaking the
-the percentage of partitions in the ring that have too many replicas within a
+keyspace up into multiple ranges (vnodes, buckets, etc.) - many more than the
-particular failure domain.
+number of "nodes" to which keys in the keyspace must be assigned.  Swift calls
 these ranges `partitions` - they are partitions of the total keyspace.
-For example, if you have three servers in a cluster but two replicas for a
+Each partition will have multiple replicas.  Every replica of each partition
-partition get placed onto the same server, that partition will count towards the
+must be assigned to a device in the ring.  When a describing a specific
-dispersion metric.
+replica of a partition (like when it's assigned a device) it is described as a
 `part-replica` in that it is a specific `replica` of the specific `partition`.
 A single device may be assigned different replicas from many parts, but it may
 not be assigned multiple replicas of a single part.
-A lower dispersion value is better, and the value can be used to find the proper
+The total number of partitions in a ring is calculated as ``2 **
-value for "overload".
+<part-power>``.  The total number of part-replicas in a ring is calculated as
 ``<replica-count> * 2 ** <part-power>``.
-*********************
+When considering a device's `weight` it is useful to describe the number of
-Partition Shift Value
+part-replicas it would like to be assigned.  A single device regardless of
-*********************
+weight will never hold more than ``2 ** <part-power>`` part-replicas because
 it can not have more than one replica of any part assigned.  The number of
 part-replicas a device can take by weights is calculated as it's
 `parts_wanted`.  The true number of part-replicas assigned to a device can be
 compared to it's parts wanted similarly to a calculation of percentage error -
 this deviation in the observed result from the idealized target is called a
 devices `balance`.
-The partition shift value is known internally to the Ring class as _part_shift.
+When considering a device's `failure domain` it is useful to describe the
-This value used to shift an MD5 hash to calculate the partition on which the
+number of part-replicas it would like to be assigned.  The number of
-data for that hash should reside. Only the top four bytes of the hash is used
+part-replicas wanted in a failure domain of a tier is the sum of the
-in this process. For example, to compute the partition for the path
+part-replicas wanted in the failure domains of it's sub-tier.  However,
-/account/container/object the Python code might look like: ``partition =
+collectively when the total number of part-replicas in a failure domain
-unpack_from('>I', md5('/account/container/object').digest())[0] >>
+exceeds or is equal to ``2 ** <part-power>`` it is most obvious that it's no
-self._part_shift``
+longer sufficient to consider only the number of total part-replicas, but
-
+rather the fraction of each replica's partitions.  Consider for example a ring
-For a ring generated with part_power P, the partition shift value is
+with ``3`` replicas and ``3`` servers, while it's necessary for dispersion
-32 - P.
+that each server hold only ``1/3`` of the total part-replicas it is
 additionally constrained to require ``1.0`` replica of *each* partition.  It
 would not be sufficient to satisfy dispersion if two devices on one of the
 servers each held a replica of a single partition, while another server held
 none.  By considering a decimal fraction of one replica's worth of parts in a
 failure domain we can derive the total part-replicas wanted in a failure
 domain (``1.0 * 2 ** <part-power>``).  Additionally we infer more about
 `which` part-replicas must go in the failure domain.  Consider a ring with
 three replicas, and two zones, each with two servers (four servers total).
 The three replicas worth of partitions will be assigned into two failure
 domains at the zone tier.  Each zone must hold more than one replica of some
 parts.  We represent this improper faction of a replica's worth of partitions
 in decimal form as ``1.5`` (``3.0 / 2``).  This tells us not only the *number*
 of total parts (``1.5 * 2 ** <part-power>``) but also that *each* partition
 must have `at least` one replica in this failure domain (in fact ``0.5`` of
 the partitions will have ``2`` replicas).  Within each zone the two servers
 will hold ``0.75`` of a replica's worth of partitions - this is equal both to
 "the fraction of a replica's worth of partitions assigned to each zone
 (``1.5``) divided evenly among the number of failure domain's in it's sub-tier
 (``2`` servers in each zone, i.e.  ``1.5 / 2``)" but *also* "the total number
 of replicas (``3.0``) divided evenly among the total number of failure domains
 in the server tier (``2`` servers x ``2`` zones = ``4``, i.e.  ``3.0 / 4``)".
 It is useful to consider that each server in this ring will hold only ``0.75``
 of a replica's worth of partitions which tells that any server should have `at
 most` one replica of a given part assigned.  In the interests of brevity, some
 variable names will often refer to the concept representing the fraction of a
 replica's worth of partitions in decimal form as *replicanths* - this is meant
 to invoke connotations similar to ordinal numbers as applied to fractions, but
 generalized to a replica instead of four*th* or a fif*th*.  The 'n' was
 probably thrown in because of Blade Runner.
 -----------------
 Building the Ring
 -----------------
-The initial building of the ring first calculates the number of partitions that
+First the ring builder calculates the replicanths wanted at each tier in the
-should ideally be assigned to each device based the device's weight. For
+ring's topology based on weight.
 example, given a partition power of 20, the ring will have 1,048,576 partitions.
 If there are 1,000 devices of equal weight they will each desire 1,048.576
 partitions. The devices are then sorted by the number of partitions they desire
 and kept in order throughout the initialization process.
-Note: each device is also assigned a random tiebreaker value that is used when
+Then the ring builder calculates the replicanths wanted at each tier in the
-two devices desire the same number of partitions. This tiebreaker is not stored
+ring's topology based on dispersion.
 on disk anywhere, and so two different rings created with the same parameters
 will have different partition assignments. For repeatable partition assignments,
 ``RingBuilder.rebalance()`` takes an optional seed value that will be used to
 seed Python's pseudo-random number generator.
-Then, the ring builder assigns each replica of each partition to the device that
+Then the ring calculates the maximum deviation on a single device between it's
-desires the most partitions at that point while keeping it as far away as
+weighted replicanths and wanted replicanths.
 possible from other replicas. The ring builder prefers to assign a replica to a
 device in a regions that has no replicas already; should there be no such region
 available, the ring builder will try to find a device in a different zone; if
 not possible, it will look on a different server; failing that, it will just
 look for a device that has no replicas; finally, if all other options are
 exhausted, the ring builder will assign the replica to the device that has the
 fewest replicas already assigned. Note that assignment of multiple replicas to
 one device will only happen if the ring has fewer devices than it has replicas.
-When building a new ring based on an old ring, the desired number of partitions
+Next we interpolate between the two replicanth values (weighted & wanted) at
-each device wants is recalculated. Next the partitions to be reassigned are
+each tier using the specified overload (up to the maximum required overload).
-gathered up. Any removed devices have all their assigned partitions unassigned
+It's a linear interpolation, similar to solving for a point on a line between
-and added to the gathered list. Any partition replicas that (due to the
+two points - we calculate the slope across the max required overload and then
-addition of new devices) can be spread out for better durability are unassigned
+calculate the intersection of the line with the desired overload.  This
-and added to the gathered list. Any devices that have more partitions than they
+becomes the target.
-now desire have random partitions unassigned from them and added to the
+
-gathered list. Lastly, the gathered partitions are then reassigned to devices
+From the target we calculate the minimum and maximum number of replicas any
-using a similar method as in the initial assignment described above.
+part may have in a tier.  This becomes the replica_plan.
 Finally, we calculate the number of partitions that should ideally be assigned
 to each device based the replica_plan.
 On initial balance, the first time partitions are placed to generate a ring,
 we must assign each replica of each partition to the device that desires the
 most partitions excluding any devices that already have their maximum number
 of replicas of that part assigned to some parent tier of that device's failure
 domain.
 When building a new ring based on an old ring, the desired number of
 partitions each device wants is recalculated from the current replica_plan.
 Next the partitions to be reassigned are gathered up. Any removed devices have
 all their assigned partitions unassigned and added to the gathered list. Any
 partition replicas that (due to the addition of new devices) can be spread out
 for better durability are unassigned and added to the gathered list. Any
 devices that have more partitions than they now desire have random partitions
 unassigned from them and added to the gathered list. Lastly, the gathered
 partitions are then reassigned to devices using a similar method as in the
 initial assignment described above.
 Whenever a partition has a replica reassigned, the time of the reassignment is
 recorded. This is taken into account when gathering partitions to reassign so
@ -247,10 +330,9 @@ failure and there's no choice but to make a reassignment.
 The above processes don't always perfectly rebalance a ring due to the random
 nature of gathering partitions for reassignment. To help reach a more balanced
-ring, the rebalance process is repeated until near perfect (less 1% off) or
+ring, the rebalance process is repeated a fixed number of times until the
-when the balance doesn't improve by at least 1% (indicating we probably can't
+replica_plan is fulfilled or unable to be fulfilled (indicating we probably
-get perfect balance due to wildly imbalanced zones or too many partitions
+can't get perfect balance due to too many partitions recently moved).
 recently moved).
 ---------------------
 Ring Builder Analyzer
@ -263,8 +345,8 @@ History
 -------
 The ring code went through many iterations before arriving at what it is now
-and while it has been stable for a while now, the algorithm may be tweaked or
+and while it has largely been stable, the algorithm has seen a few tweaks or
-perhaps even fundamentally changed if new ideas emerge. This section will try
+perhaps even fundamentally changed as new ideas emerge. This section will try
 to describe the previous ideas attempted and attempt to explain why they were
 discarded.
@ -329,15 +411,14 @@ be maintaining the rings themselves anyway and only doing hash lookups, MD5 was
 chosen for its general availability, good distribution, and adequate speed.
 The placement algorithm has seen a number of behavioral changes for
-unbalanceable rings. The ring builder wants to keep replicas as far
+unbalanceable rings. The ring builder wants to keep replicas as far apart as
-apart as possible while still respecting device weights. In most
+possible while still respecting device weights. In most cases, the ring
-cases, the ring builder can achieve both, but sometimes they conflict.
+builder can achieve both, but sometimes they conflict.  At first, the behavior
-At first, the behavior was to keep the replicas far apart and ignore
+was to keep the replicas far apart and ignore device weight, but that made it
-device weight, but that made it impossible to gradually go from one
+impossible to gradually go from one region to two, or from two to three. Then
-region to two, or from two to three. Then it was changed to favor
+it was changed to favor device weight over dispersion, but that wasn't so good
-device weight over dispersion, but that wasn't so good for rings that
+for rings that were close to balanceable, like 3 machines with 60TB, 60TB, and
-were close to balanceable, like 3 machines with 60TB, 60TB, and 57TB
+57TB of disk space; operators were expecting one replica per machine, but
-of disk space; operators were expecting one replica per machine, but
+didn't always get it. After that, overload was added to the ring builder so
-didn't always get it. After that, overload was added to the ring
+that operators could choose a balance between dispersion and device weights.
-builder so that operators could choose a balance between dispersion
+In time the overload concept was improved and made more accurate.
 and device weights.