Add notion of overload to swift-ring-builder
The ring builder's placement algorithm has two goals: first, to ensure that each partition has its replicas as far apart as possible, and second, to ensure that partitions are fairly distributed according to device weight. In many cases, it succeeds in both, but sometimes those goals conflict. When that happens, operators may want to relax the rules a little bit in order to reach a compromise solution. Imagine a cluster of 3 nodes (A, B, C), each with 20 identical disks, and using 3 replicas. The ring builder will place 1 replica of each partition on each node, as you'd expect. Now imagine that one disk fails in node C and is removed from the ring. The operator would probably be okay with remaining at 1 replica per node (unless their disks are really close to full), but to accomplish that, they have to multiply the weights of the other disks in node C by 20/19 to make C's total weight stay the same. Otherwise, the ring builder will move partitions around such that some partitions have replicas only on nodes A and B. If 14 more disks failed in node C, the operator would probably be okay with some data not living on C, as a 4x increase in storage requirements is likely to fill disks. This commit introduces the notion of "overload": how much extra partition space can be placed on each disk *over* what the weight dictates. For example, an overload of 0.1 means that a device can take up to 10% more partitions than its weight would imply in order to make the replica dispersion better. Overload only has an effect when replica-dispersion and device weights come into conflict. The overload is a single floating-point value for the builder file. Existing builders get an overload of 0.0, so there will be no behavior change on existing rings. In the example above, imagine the operator sets an overload of 0.112 on his rings. If node C loses a drive, each other drive can take on up to 11.2% more data. Splitting the dead drive's partitions among the remaining 19 results in a 5.26% increase, so everything that was on node C stays on node C. If another disk dies, then we're up to an 11.1% increase, and so everything still stays on node C. If a third disk dies, then we've reached the limits of the overload, so some partitions will begin to reside solely on nodes A and B. DocImpact Change-Id: I3593a1defcd63b6ed8eae9c1c66b9d3428b33864
This commit is contained in:
parent
4cdb51418c
commit
bcf26f5209
@ -130,6 +130,43 @@ for the ring. This means that some partitions will have more replicas than
|
|||||||
others. For example, if a ring has 3.25 replicas, then 25% of its partitions
|
others. For example, if a ring has 3.25 replicas, then 25% of its partitions
|
||||||
will have four replicas, while the remaining 75% will have just three.
|
will have four replicas, while the remaining 75% will have just three.
|
||||||
|
|
||||||
|
********
|
||||||
|
Overload
|
||||||
|
********
|
||||||
|
|
||||||
|
The ring builder tries to keep replicas as far apart as possible while
|
||||||
|
still respecting device weights. When it can't do both, the overload
|
||||||
|
factor determines what happens. Each device will take some extra
|
||||||
|
fraction of its desired partitions to allow for replica dispersion;
|
||||||
|
once that extra fraction is exhausted, replicas will be placed closer
|
||||||
|
together than optimal.
|
||||||
|
|
||||||
|
Essentially, the overload factor lets the operator trade off replica
|
||||||
|
dispersion (durability) against data dispersion (uniform disk usage).
|
||||||
|
|
||||||
|
The default overload factor is 0, so device weights will be strictly
|
||||||
|
followed.
|
||||||
|
|
||||||
|
With an overload factor of 0.1, each device will accept 10% more
|
||||||
|
partitions than it otherwise would, but only if needed to maintain
|
||||||
|
partition dispersion.
|
||||||
|
|
||||||
|
Example: Consider a 3-node cluster of machines with equal-size disks;
|
||||||
|
let node A have 12 disks, node B have 12 disks, and node C have only
|
||||||
|
11 disks. Let the ring have an overload factor of 0.1 (10%).
|
||||||
|
|
||||||
|
Without the overload, some partitions would end up with replicas only
|
||||||
|
on nodes A and B. However, with the overload, every device is willing
|
||||||
|
to accept up to 10% more partitions for the sake of dispersion. The
|
||||||
|
missing disk in C means there is one disk's worth of partitions that
|
||||||
|
would like to spread across the remaining 11 disks, which gives each
|
||||||
|
disk in C an extra 9.09% load. Since this is less than the 10%
|
||||||
|
overload, there is one replica of each partition on each node.
|
||||||
|
|
||||||
|
However, this does mean that the disks in node C will have more data
|
||||||
|
on them than the disks in nodes A and B. If 80% full is the warning
|
||||||
|
threshold for the cluster, node C's disks will reach 80% full while A
|
||||||
|
and B's disks are only 72.7% full.
|
||||||
|
|
||||||
*********************
|
*********************
|
||||||
Partition Shift Value
|
Partition Shift Value
|
||||||
@ -269,3 +306,17 @@ faster, but MD5 was built-in and hash computation is a small percentage of the
|
|||||||
overall request handling time. In all, once it was decided the servers wouldn't
|
overall request handling time. In all, once it was decided the servers wouldn't
|
||||||
be maintaining the rings themselves anyway and only doing hash lookups, MD5 was
|
be maintaining the rings themselves anyway and only doing hash lookups, MD5 was
|
||||||
chosen for its general availability, good distribution, and adequate speed.
|
chosen for its general availability, good distribution, and adequate speed.
|
||||||
|
|
||||||
|
The placement algorithm has seen a number of behavioral changes for
|
||||||
|
unbalanceable rings. The ring builder wants to keep replicas as far
|
||||||
|
apart as possible while still respecting device weights. In most
|
||||||
|
cases, the ring builder can achieve both, but sometimes they conflict.
|
||||||
|
At first, the behavior was to keep the replicas far apart and ignore
|
||||||
|
device weight, but that made it impossible to gradually go from one
|
||||||
|
region to two, or from two to three. Then it was changed to favor
|
||||||
|
device weight over dispersion, but that wasn't so good for rings that
|
||||||
|
were close to balanceable, like 3 machines with 60TB, 60TB, and 57TB
|
||||||
|
of disk space; operators were expecting one replica per machine, but
|
||||||
|
didn't always get it. After that, overload was added to the ring
|
||||||
|
builder so that operators could choose a balance between dispersion
|
||||||
|
and device weights.
|
||||||
|
@ -251,6 +251,7 @@ swift-ring-builder <builder_file>
|
|||||||
balance)
|
balance)
|
||||||
print 'The minimum number of hours before a partition can be ' \
|
print 'The minimum number of hours before a partition can be ' \
|
||||||
'reassigned is %s' % builder.min_part_hours
|
'reassigned is %s' % builder.min_part_hours
|
||||||
|
print 'The overload factor is %.6f' % builder.overload
|
||||||
if builder.devs:
|
if builder.devs:
|
||||||
print 'Devices: id region zone ip address port ' \
|
print 'Devices: id region zone ip address port ' \
|
||||||
'replication ip replication port name ' \
|
'replication ip replication port name ' \
|
||||||
@ -650,7 +651,7 @@ swift-ring-builder <builder_file> rebalance <seed>
|
|||||||
print 'Reassigned %d (%.02f%%) partitions. Balance is now %.02f.' % \
|
print 'Reassigned %d (%.02f%%) partitions. Balance is now %.02f.' % \
|
||||||
(parts, 100.0 * parts / builder.parts, balance)
|
(parts, 100.0 * parts / builder.parts, balance)
|
||||||
status = EXIT_SUCCESS
|
status = EXIT_SUCCESS
|
||||||
if balance > 5:
|
if balance > 5 and balance / 100.0 > builder.overload:
|
||||||
print '-' * 79
|
print '-' * 79
|
||||||
print 'NOTE: Balance of %.02f indicates you should push this ' % \
|
print 'NOTE: Balance of %.02f indicates you should push this ' % \
|
||||||
balance
|
balance
|
||||||
@ -794,6 +795,35 @@ swift-ring-builder <builder_file> set_replicas <replicas>
|
|||||||
builder.save(argv[1])
|
builder.save(argv[1])
|
||||||
exit(EXIT_SUCCESS)
|
exit(EXIT_SUCCESS)
|
||||||
|
|
||||||
|
def set_overload():
|
||||||
|
"""
|
||||||
|
swift-ring-builder <builder_file> set_overload <overload>
|
||||||
|
Changes the overload factor to the given <overload>.
|
||||||
|
|
||||||
|
A rebalance is needed to make the change take effect.
|
||||||
|
"""
|
||||||
|
if len(argv) < 4:
|
||||||
|
print Commands.set_overload.__doc__.strip()
|
||||||
|
exit(EXIT_ERROR)
|
||||||
|
|
||||||
|
new_overload = argv[3]
|
||||||
|
try:
|
||||||
|
new_overload = float(new_overload)
|
||||||
|
except ValueError:
|
||||||
|
print Commands.set_overload.__doc__.strip()
|
||||||
|
print "\"%s\" is not a valid number." % new_overload
|
||||||
|
exit(EXIT_ERROR)
|
||||||
|
|
||||||
|
if new_overload < 0:
|
||||||
|
print "Overload must be non-negative."
|
||||||
|
exit(EXIT_ERROR)
|
||||||
|
|
||||||
|
builder.set_overload(new_overload)
|
||||||
|
print 'The overload is now %.6f.' % builder.overload
|
||||||
|
print 'The change will take effect after the next rebalance.'
|
||||||
|
builder.save(argv[1])
|
||||||
|
exit(EXIT_SUCCESS)
|
||||||
|
|
||||||
|
|
||||||
def main(arguments=None):
|
def main(arguments=None):
|
||||||
global argv, backup_dir, builder, builder_file, ring_file
|
global argv, backup_dir, builder, builder_file, ring_file
|
||||||
|
@ -66,6 +66,7 @@ class RingBuilder(object):
|
|||||||
self.devs = []
|
self.devs = []
|
||||||
self.devs_changed = False
|
self.devs_changed = False
|
||||||
self.version = 0
|
self.version = 0
|
||||||
|
self.overload = 0.0
|
||||||
|
|
||||||
# _replica2part2dev maps from replica number to partition number to
|
# _replica2part2dev maps from replica number to partition number to
|
||||||
# device id. So, for a three replica, 2**23 ring, it's an array of
|
# device id. So, for a three replica, 2**23 ring, it's an array of
|
||||||
@ -122,6 +123,7 @@ class RingBuilder(object):
|
|||||||
self.parts = builder.parts
|
self.parts = builder.parts
|
||||||
self.devs = builder.devs
|
self.devs = builder.devs
|
||||||
self.devs_changed = builder.devs_changed
|
self.devs_changed = builder.devs_changed
|
||||||
|
self.overload = builder.overload
|
||||||
self.version = builder.version
|
self.version = builder.version
|
||||||
self._replica2part2dev = builder._replica2part2dev
|
self._replica2part2dev = builder._replica2part2dev
|
||||||
self._last_part_moves_epoch = builder._last_part_moves_epoch
|
self._last_part_moves_epoch = builder._last_part_moves_epoch
|
||||||
@ -135,6 +137,7 @@ class RingBuilder(object):
|
|||||||
self.parts = builder['parts']
|
self.parts = builder['parts']
|
||||||
self.devs = builder['devs']
|
self.devs = builder['devs']
|
||||||
self.devs_changed = builder['devs_changed']
|
self.devs_changed = builder['devs_changed']
|
||||||
|
self.overload = builder.get('overload', 0.0)
|
||||||
self.version = builder['version']
|
self.version = builder['version']
|
||||||
self._replica2part2dev = builder['_replica2part2dev']
|
self._replica2part2dev = builder['_replica2part2dev']
|
||||||
self._last_part_moves_epoch = builder['_last_part_moves_epoch']
|
self._last_part_moves_epoch = builder['_last_part_moves_epoch']
|
||||||
@ -162,6 +165,7 @@ class RingBuilder(object):
|
|||||||
'devs': self.devs,
|
'devs': self.devs,
|
||||||
'devs_changed': self.devs_changed,
|
'devs_changed': self.devs_changed,
|
||||||
'version': self.version,
|
'version': self.version,
|
||||||
|
'overload': self.overload,
|
||||||
'_replica2part2dev': self._replica2part2dev,
|
'_replica2part2dev': self._replica2part2dev,
|
||||||
'_last_part_moves_epoch': self._last_part_moves_epoch,
|
'_last_part_moves_epoch': self._last_part_moves_epoch,
|
||||||
'_last_part_moves': self._last_part_moves,
|
'_last_part_moves': self._last_part_moves,
|
||||||
@ -202,6 +206,9 @@ class RingBuilder(object):
|
|||||||
|
|
||||||
self.replicas = new_replica_count
|
self.replicas = new_replica_count
|
||||||
|
|
||||||
|
def set_overload(self, overload):
|
||||||
|
self.overload = overload
|
||||||
|
|
||||||
def get_ring(self):
|
def get_ring(self):
|
||||||
"""
|
"""
|
||||||
Get the ring, or more specifically, the swift.common.ring.RingData.
|
Get the ring, or more specifically, the swift.common.ring.RingData.
|
||||||
@ -545,8 +552,8 @@ class RingBuilder(object):
|
|||||||
# the last would not, probably resulting in a crash. This
|
# the last would not, probably resulting in a crash. This
|
||||||
# way, some devices end up with leftover parts_wanted, but
|
# way, some devices end up with leftover parts_wanted, but
|
||||||
# at least every partition ends up somewhere.
|
# at least every partition ends up somewhere.
|
||||||
int(math.ceil(weight_of_one_part * dev['weight'])) -
|
int(math.ceil(weight_of_one_part * dev['weight']
|
||||||
dev['parts'])
|
- dev['parts'])))
|
||||||
|
|
||||||
def _adjust_replica2part2dev_size(self):
|
def _adjust_replica2part2dev_size(self):
|
||||||
"""
|
"""
|
||||||
@ -655,10 +662,12 @@ class RingBuilder(object):
|
|||||||
"""
|
"""
|
||||||
wanted_parts_for_tier = {}
|
wanted_parts_for_tier = {}
|
||||||
for dev in self._iter_devs():
|
for dev in self._iter_devs():
|
||||||
pw = max(0, dev['parts_wanted'])
|
pw = (max(0, dev['parts_wanted']) +
|
||||||
|
max(int(math.ceil(
|
||||||
|
(dev['parts_wanted'] + dev['parts']) * self.overload)),
|
||||||
|
0))
|
||||||
for tier in tiers_for_dev(dev):
|
for tier in tiers_for_dev(dev):
|
||||||
if tier not in wanted_parts_for_tier:
|
wanted_parts_for_tier.setdefault(tier, 0)
|
||||||
wanted_parts_for_tier[tier] = 0
|
|
||||||
wanted_parts_for_tier[tier] += pw
|
wanted_parts_for_tier[tier] += pw
|
||||||
return wanted_parts_for_tier
|
return wanted_parts_for_tier
|
||||||
|
|
||||||
@ -847,24 +856,30 @@ class RingBuilder(object):
|
|||||||
replicas_to_replace may be shared for multiple
|
replicas_to_replace may be shared for multiple
|
||||||
partitions, so be sure you do not modify it.
|
partitions, so be sure you do not modify it.
|
||||||
"""
|
"""
|
||||||
|
fudge_available_in_tier = defaultdict(int)
|
||||||
parts_available_in_tier = defaultdict(int)
|
parts_available_in_tier = defaultdict(int)
|
||||||
for dev in self._iter_devs():
|
for dev in self._iter_devs():
|
||||||
dev['sort_key'] = self._sort_key_for(dev)
|
dev['sort_key'] = self._sort_key_for(dev)
|
||||||
tiers = tiers_for_dev(dev)
|
tiers = tiers_for_dev(dev)
|
||||||
dev['tiers'] = tiers
|
dev['tiers'] = tiers
|
||||||
|
# Note: this represents how many partitions may be assigned to a
|
||||||
|
# given tier (region/zone/server/disk). It does not take into
|
||||||
|
# account how many partitions a given tier wants to shed.
|
||||||
|
#
|
||||||
|
# If we did not do this, we could have a zone where, at some
|
||||||
|
# point during assignment, number-of-parts-to-gain equals
|
||||||
|
# number-of-parts-to-shed. At that point, no further placement
|
||||||
|
# into that zone would occur since its parts_available_in_tier
|
||||||
|
# would be 0. This would happen any time a zone had any device
|
||||||
|
# with partitions to shed, which is any time a device is being
|
||||||
|
# removed, which is a pretty frequent operation.
|
||||||
|
wanted = max(dev['parts_wanted'], 0)
|
||||||
|
fudge = max(int(math.ceil(
|
||||||
|
(dev['parts_wanted'] + dev['parts']) * self.overload)),
|
||||||
|
0)
|
||||||
for tier in tiers:
|
for tier in tiers:
|
||||||
# Note: this represents how many partitions may be assigned to
|
fudge_available_in_tier[tier] += (wanted + fudge)
|
||||||
# a given tier (region/zone/server/disk). It does not take
|
parts_available_in_tier[tier] += wanted
|
||||||
# into account how many partitions a given tier wants to shed.
|
|
||||||
#
|
|
||||||
# If we did not do this, we could have a zone where, at some
|
|
||||||
# point during assignment, number-of-parts-to-gain equals
|
|
||||||
# number-of-parts-to-shed. At that point, no further placement
|
|
||||||
# into that zone would occur since its parts_available_in_tier
|
|
||||||
# would be 0. This would happen any time a zone had any device
|
|
||||||
# with partitions to shed, which is any time a device is being
|
|
||||||
# removed, which is a pretty frequent operation.
|
|
||||||
parts_available_in_tier[tier] += max(dev['parts_wanted'], 0)
|
|
||||||
|
|
||||||
available_devs = \
|
available_devs = \
|
||||||
sorted((d for d in self._iter_devs() if d['weight']),
|
sorted((d for d in self._iter_devs() if d['weight']),
|
||||||
@ -916,6 +931,7 @@ class RingBuilder(object):
|
|||||||
tier = ()
|
tier = ()
|
||||||
depth = 1
|
depth = 1
|
||||||
while depth <= max_tier_depth:
|
while depth <= max_tier_depth:
|
||||||
|
roomiest_tier = fudgiest_tier = None
|
||||||
# Order the tiers by how many replicas of this
|
# Order the tiers by how many replicas of this
|
||||||
# partition they already have. Then, of the ones
|
# partition they already have. Then, of the ones
|
||||||
# with the smallest number of replicas and that have
|
# with the smallest number of replicas and that have
|
||||||
@ -954,22 +970,43 @@ class RingBuilder(object):
|
|||||||
candidates_with_room = [
|
candidates_with_room = [
|
||||||
t for t in tier2children[tier]
|
t for t in tier2children[tier]
|
||||||
if parts_available_in_tier[t] > 0]
|
if parts_available_in_tier[t] > 0]
|
||||||
|
candidates_with_fudge = set([
|
||||||
|
t for t in tier2children[tier]
|
||||||
|
if fudge_available_in_tier[t] > 0])
|
||||||
|
candidates_with_fudge.update(candidates_with_room)
|
||||||
|
|
||||||
if len(candidates_with_room) > \
|
if candidates_with_room:
|
||||||
len(candidates_with_replicas):
|
if len(candidates_with_room) > \
|
||||||
|
len(candidates_with_replicas):
|
||||||
# There exists at least one tier with room for
|
# There exists at least one tier with room for
|
||||||
# another partition and 0 other replicas already
|
# another partition and 0 other replicas already in
|
||||||
# in it, so we can use a faster search. The else
|
# it, so we can use a faster search. The else
|
||||||
# branch's search would work here, but it's
|
# branch's search would work here, but it's
|
||||||
# significantly slower.
|
# significantly slower.
|
||||||
tier = max((t for t in candidates_with_room
|
roomiest_tier = max(
|
||||||
if other_replicas[t] == 0),
|
(t for t in candidates_with_room
|
||||||
key=tier2sort_key.__getitem__)
|
if other_replicas[t] == 0),
|
||||||
|
key=tier2sort_key.__getitem__)
|
||||||
|
else:
|
||||||
|
roomiest_tier = max(
|
||||||
|
candidates_with_room,
|
||||||
|
key=lambda t: (-other_replicas[t],
|
||||||
|
tier2sort_key[t]))
|
||||||
else:
|
else:
|
||||||
tier = max(candidates_with_room,
|
roomiest_tier = None
|
||||||
key=lambda t: (-other_replicas[t],
|
|
||||||
tier2sort_key[t]))
|
fudgiest_tier = max(candidates_with_fudge,
|
||||||
|
key=lambda t: (-other_replicas[t],
|
||||||
|
tier2sort_key[t]))
|
||||||
|
|
||||||
|
if (roomiest_tier is None or
|
||||||
|
(other_replicas[roomiest_tier] >
|
||||||
|
other_replicas[fudgiest_tier])):
|
||||||
|
tier = fudgiest_tier
|
||||||
|
else:
|
||||||
|
tier = roomiest_tier
|
||||||
depth += 1
|
depth += 1
|
||||||
|
|
||||||
dev = tier2devs[tier][-1]
|
dev = tier2devs[tier][-1]
|
||||||
dev['parts_wanted'] -= 1
|
dev['parts_wanted'] -= 1
|
||||||
dev['parts'] += 1
|
dev['parts'] += 1
|
||||||
@ -977,6 +1014,7 @@ class RingBuilder(object):
|
|||||||
new_sort_key = dev['sort_key'] = self._sort_key_for(dev)
|
new_sort_key = dev['sort_key'] = self._sort_key_for(dev)
|
||||||
for tier in dev['tiers']:
|
for tier in dev['tiers']:
|
||||||
parts_available_in_tier[tier] -= 1
|
parts_available_in_tier[tier] -= 1
|
||||||
|
fudge_available_in_tier[tier] -= 1
|
||||||
other_replicas[tier] += 1
|
other_replicas[tier] += 1
|
||||||
occupied_tiers_by_tier_len[len(tier)].add(tier)
|
occupied_tiers_by_tier_len[len(tier)].add(tier)
|
||||||
|
|
||||||
|
@ -15,6 +15,7 @@
|
|||||||
|
|
||||||
import mock
|
import mock
|
||||||
import os
|
import os
|
||||||
|
import StringIO
|
||||||
import tempfile
|
import tempfile
|
||||||
import unittest
|
import unittest
|
||||||
import uuid
|
import uuid
|
||||||
@ -213,6 +214,27 @@ class TestCommands(unittest.TestCase):
|
|||||||
ring = RingBuilder.load(self.tmpfile)
|
ring = RingBuilder.load(self.tmpfile)
|
||||||
self.assertEqual(ring.replicas, 3.14159265359)
|
self.assertEqual(ring.replicas, 3.14159265359)
|
||||||
|
|
||||||
|
def test_set_overload(self):
|
||||||
|
self.create_sample_ring()
|
||||||
|
argv = ["", self.tmpfile, "set_overload", "0.19878"]
|
||||||
|
self.assertRaises(SystemExit, swift.cli.ringbuilder.main, argv)
|
||||||
|
ring = RingBuilder.load(self.tmpfile)
|
||||||
|
self.assertEqual(ring.overload, 0.19878)
|
||||||
|
|
||||||
|
def test_set_overload_negative(self):
|
||||||
|
self.create_sample_ring()
|
||||||
|
argv = ["", self.tmpfile, "set_overload", "-0.19878"]
|
||||||
|
self.assertRaises(SystemExit, swift.cli.ringbuilder.main, argv)
|
||||||
|
ring = RingBuilder.load(self.tmpfile)
|
||||||
|
self.assertEqual(ring.overload, 0.0)
|
||||||
|
|
||||||
|
def test_set_overload_non_numeric(self):
|
||||||
|
self.create_sample_ring()
|
||||||
|
argv = ["", self.tmpfile, "set_overload", "swedish fish"]
|
||||||
|
self.assertRaises(SystemExit, swift.cli.ringbuilder.main, argv)
|
||||||
|
ring = RingBuilder.load(self.tmpfile)
|
||||||
|
self.assertEqual(ring.overload, 0.0)
|
||||||
|
|
||||||
def test_validate(self):
|
def test_validate(self):
|
||||||
self.create_sample_ring()
|
self.create_sample_ring()
|
||||||
ring = RingBuilder.load(self.tmpfile)
|
ring = RingBuilder.load(self.tmpfile)
|
||||||
@ -273,5 +295,81 @@ class TestCommands(unittest.TestCase):
|
|||||||
except SystemExit as e:
|
except SystemExit as e:
|
||||||
self.assertEquals(e.code, 2)
|
self.assertEquals(e.code, 2)
|
||||||
|
|
||||||
|
|
||||||
|
class TestRebalanceCommand(unittest.TestCase):
|
||||||
|
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
super(TestRebalanceCommand, self).__init__(*args, **kwargs)
|
||||||
|
tmpf = tempfile.NamedTemporaryFile()
|
||||||
|
self.tempfile = tmpf.name
|
||||||
|
|
||||||
|
def tearDown(self):
|
||||||
|
try:
|
||||||
|
os.remove(self.tempfile)
|
||||||
|
except OSError:
|
||||||
|
pass
|
||||||
|
|
||||||
|
def run_srb(self, *argv):
|
||||||
|
mock_stdout = StringIO.StringIO()
|
||||||
|
mock_stderr = StringIO.StringIO()
|
||||||
|
|
||||||
|
srb_args = ["", self.tempfile] + [str(s) for s in argv]
|
||||||
|
|
||||||
|
try:
|
||||||
|
with mock.patch("sys.stdout", mock_stdout):
|
||||||
|
with mock.patch("sys.stderr", mock_stderr):
|
||||||
|
swift.cli.ringbuilder.main(srb_args)
|
||||||
|
except SystemExit as err:
|
||||||
|
if err.code not in (0, 1): # (success, warning)
|
||||||
|
raise
|
||||||
|
return (mock_stdout.getvalue(), mock_stderr.getvalue())
|
||||||
|
|
||||||
|
def test_rebalance_warning_appears(self):
|
||||||
|
self.run_srb("create", 8, 3, 24)
|
||||||
|
# all in one machine: totally balanceable
|
||||||
|
self.run_srb("add",
|
||||||
|
"r1z1-10.1.1.1:2345/sda", 100.0,
|
||||||
|
"r1z1-10.1.1.1:2345/sdb", 100.0,
|
||||||
|
"r1z1-10.1.1.1:2345/sdc", 100.0,
|
||||||
|
"r1z1-10.1.1.1:2345/sdd", 100.0)
|
||||||
|
out, err = self.run_srb("rebalance")
|
||||||
|
self.assertTrue("rebalance/repush" not in out)
|
||||||
|
|
||||||
|
# 2 machines of equal size: balanceable, but not in one pass due to
|
||||||
|
# min_part_hours > 0
|
||||||
|
self.run_srb("add",
|
||||||
|
"r1z1-10.1.1.2:2345/sda", 100.0,
|
||||||
|
"r1z1-10.1.1.2:2345/sdb", 100.0,
|
||||||
|
"r1z1-10.1.1.2:2345/sdc", 100.0,
|
||||||
|
"r1z1-10.1.1.2:2345/sdd", 100.0)
|
||||||
|
self.run_srb("pretend_min_part_hours_passed")
|
||||||
|
out, err = self.run_srb("rebalance")
|
||||||
|
self.assertTrue("rebalance/repush" in out)
|
||||||
|
|
||||||
|
# after two passes, it's all balanced out
|
||||||
|
self.run_srb("pretend_min_part_hours_passed")
|
||||||
|
out, err = self.run_srb("rebalance")
|
||||||
|
self.assertTrue("rebalance/repush" not in out)
|
||||||
|
|
||||||
|
def test_rebalance_warning_with_overload(self):
|
||||||
|
self.run_srb("create", 8, 3, 24)
|
||||||
|
self.run_srb("set_overload", 0.12)
|
||||||
|
# The ring's balance is at least 5, so normally we'd get a warning,
|
||||||
|
# but it's suppressed due to the overload factor.
|
||||||
|
self.run_srb("add",
|
||||||
|
"r1z1-10.1.1.1:2345/sda", 100.0,
|
||||||
|
"r1z1-10.1.1.1:2345/sdb", 100.0,
|
||||||
|
"r1z1-10.1.1.1:2345/sdc", 120.0)
|
||||||
|
out, err = self.run_srb("rebalance")
|
||||||
|
self.assertTrue("rebalance/repush" not in out)
|
||||||
|
|
||||||
|
# Now we add in a really big device, but not enough partitions move
|
||||||
|
# to fill it in one pass, so we see the rebalance warning.
|
||||||
|
self.run_srb("add", "r1z1-10.1.1.1:2345/sdd", 99999.0)
|
||||||
|
self.run_srb("pretend_min_part_hours_passed")
|
||||||
|
out, err = self.run_srb("rebalance")
|
||||||
|
self.assertTrue("rebalance/repush" in out)
|
||||||
|
|
||||||
|
|
||||||
if __name__ == '__main__':
|
if __name__ == '__main__':
|
||||||
unittest.main()
|
unittest.main()
|
||||||
|
@ -38,6 +38,17 @@ class TestRingBuilder(unittest.TestCase):
|
|||||||
def tearDown(self):
|
def tearDown(self):
|
||||||
rmtree(self.testdir, ignore_errors=1)
|
rmtree(self.testdir, ignore_errors=1)
|
||||||
|
|
||||||
|
def _partition_counts(self, builder):
|
||||||
|
"""
|
||||||
|
Returns a dictionary mapping (device ID) to (number of partitions
|
||||||
|
assigned to that device).
|
||||||
|
"""
|
||||||
|
counts = {}
|
||||||
|
for part2dev_id in builder._replica2part2dev:
|
||||||
|
for dev_id in part2dev_id:
|
||||||
|
counts[dev_id] = counts.get(dev_id, 0) + 1
|
||||||
|
return counts
|
||||||
|
|
||||||
def _get_population_by_region(self, builder):
|
def _get_population_by_region(self, builder):
|
||||||
"""
|
"""
|
||||||
Returns a dictionary mapping region to number of partitions in that
|
Returns a dictionary mapping region to number of partitions in that
|
||||||
@ -984,6 +995,168 @@ class TestRingBuilder(unittest.TestCase):
|
|||||||
8: 192, 9: 192,
|
8: 192, 9: 192,
|
||||||
10: 64, 11: 64})
|
10: 64, 11: 64})
|
||||||
|
|
||||||
|
def test_overload(self):
|
||||||
|
rb = ring.RingBuilder(8, 3, 1)
|
||||||
|
rb.add_dev({'id': 0, 'region': 0, 'region': 0, 'zone': 0, 'weight': 1,
|
||||||
|
'ip': '127.0.0.1', 'port': 10000, 'device': 'sda'})
|
||||||
|
rb.add_dev({'id': 1, 'region': 0, 'region': 0, 'zone': 1, 'weight': 1,
|
||||||
|
'ip': '127.0.0.1', 'port': 10001, 'device': 'sdb'})
|
||||||
|
rb.add_dev({'id': 2, 'region': 0, 'region': 0, 'zone': 2, 'weight': 2,
|
||||||
|
'ip': '127.0.0.2', 'port': 10002, 'device': 'sdc'})
|
||||||
|
rb.rebalance(seed=12345)
|
||||||
|
|
||||||
|
# sanity check: balance respects weights, so default
|
||||||
|
part_counts = self._partition_counts(rb)
|
||||||
|
self.assertEqual(part_counts[0], 192)
|
||||||
|
self.assertEqual(part_counts[1], 192)
|
||||||
|
self.assertEqual(part_counts[2], 384)
|
||||||
|
|
||||||
|
# Devices 0 and 1 take 10% more than their fair shares by weight since
|
||||||
|
# overload is 10% (0.1).
|
||||||
|
rb.set_overload(0.1)
|
||||||
|
for _ in range(2):
|
||||||
|
rb.pretend_min_part_hours_passed()
|
||||||
|
rb.rebalance(seed=12345)
|
||||||
|
|
||||||
|
part_counts = self._partition_counts(rb)
|
||||||
|
self.assertEqual(part_counts[0], 212)
|
||||||
|
self.assertEqual(part_counts[1], 212)
|
||||||
|
self.assertEqual(part_counts[2], 344)
|
||||||
|
|
||||||
|
# Now, devices 0 and 1 take 50% more than their fair shares by
|
||||||
|
# weight.
|
||||||
|
rb.set_overload(0.5)
|
||||||
|
for _ in range(3):
|
||||||
|
rb.pretend_min_part_hours_passed()
|
||||||
|
rb.rebalance(seed=12345)
|
||||||
|
|
||||||
|
part_counts = self._partition_counts(rb)
|
||||||
|
self.assertEqual(part_counts[0], 256)
|
||||||
|
self.assertEqual(part_counts[1], 256)
|
||||||
|
self.assertEqual(part_counts[2], 256)
|
||||||
|
|
||||||
|
# Devices 0 and 1 may take up to 75% over their fair share, but the
|
||||||
|
# placement algorithm only wants to spread things out evenly between
|
||||||
|
# all drives, so the devices stay at 50% more.
|
||||||
|
rb.set_overload(0.75)
|
||||||
|
for _ in range(3):
|
||||||
|
rb.pretend_min_part_hours_passed()
|
||||||
|
rb.rebalance(seed=12345)
|
||||||
|
|
||||||
|
part_counts = self._partition_counts(rb)
|
||||||
|
self.assertEqual(part_counts[0], 256)
|
||||||
|
self.assertEqual(part_counts[1], 256)
|
||||||
|
self.assertEqual(part_counts[2], 256)
|
||||||
|
|
||||||
|
def test_overload_keeps_balanceable_things_balanced_initially(self):
|
||||||
|
rb = ring.RingBuilder(8, 3, 1)
|
||||||
|
rb.add_dev({'id': 0, 'region': 0, 'region': 0, 'zone': 0, 'weight': 8,
|
||||||
|
'ip': '10.0.0.1', 'port': 10000, 'device': 'sda'})
|
||||||
|
rb.add_dev({'id': 1, 'region': 0, 'region': 0, 'zone': 0, 'weight': 8,
|
||||||
|
'ip': '10.0.0.1', 'port': 10000, 'device': 'sdb'})
|
||||||
|
|
||||||
|
rb.add_dev({'id': 2, 'region': 0, 'region': 0, 'zone': 0, 'weight': 4,
|
||||||
|
'ip': '10.0.0.2', 'port': 10000, 'device': 'sda'})
|
||||||
|
rb.add_dev({'id': 3, 'region': 0, 'region': 0, 'zone': 0, 'weight': 4,
|
||||||
|
'ip': '10.0.0.2', 'port': 10000, 'device': 'sdb'})
|
||||||
|
|
||||||
|
rb.add_dev({'id': 4, 'region': 0, 'region': 0, 'zone': 0, 'weight': 4,
|
||||||
|
'ip': '10.0.0.3', 'port': 10000, 'device': 'sda'})
|
||||||
|
rb.add_dev({'id': 5, 'region': 0, 'region': 0, 'zone': 0, 'weight': 4,
|
||||||
|
'ip': '10.0.0.3', 'port': 10000, 'device': 'sdb'})
|
||||||
|
|
||||||
|
rb.add_dev({'id': 6, 'region': 0, 'region': 0, 'zone': 0, 'weight': 4,
|
||||||
|
'ip': '10.0.0.4', 'port': 10000, 'device': 'sda'})
|
||||||
|
rb.add_dev({'id': 7, 'region': 0, 'region': 0, 'zone': 0, 'weight': 4,
|
||||||
|
'ip': '10.0.0.4', 'port': 10000, 'device': 'sdb'})
|
||||||
|
|
||||||
|
rb.add_dev({'id': 8, 'region': 0, 'region': 0, 'zone': 0, 'weight': 4,
|
||||||
|
'ip': '10.0.0.5', 'port': 10000, 'device': 'sda'})
|
||||||
|
rb.add_dev({'id': 9, 'region': 0, 'region': 0, 'zone': 0, 'weight': 4,
|
||||||
|
'ip': '10.0.0.5', 'port': 10000, 'device': 'sdb'})
|
||||||
|
|
||||||
|
rb.set_overload(99999)
|
||||||
|
rb.rebalance(seed=12345)
|
||||||
|
|
||||||
|
part_counts = self._partition_counts(rb)
|
||||||
|
self.assertEqual(part_counts, {
|
||||||
|
0: 128,
|
||||||
|
1: 128,
|
||||||
|
2: 64,
|
||||||
|
3: 64,
|
||||||
|
4: 64,
|
||||||
|
5: 64,
|
||||||
|
6: 64,
|
||||||
|
7: 64,
|
||||||
|
8: 64,
|
||||||
|
9: 64,
|
||||||
|
})
|
||||||
|
|
||||||
|
def test_overload_keeps_balanceable_things_balanced_on_rebalance(self):
|
||||||
|
rb = ring.RingBuilder(8, 3, 1)
|
||||||
|
rb.add_dev({'id': 0, 'region': 0, 'region': 0, 'zone': 0, 'weight': 8,
|
||||||
|
'ip': '10.0.0.1', 'port': 10000, 'device': 'sda'})
|
||||||
|
rb.add_dev({'id': 1, 'region': 0, 'region': 0, 'zone': 0, 'weight': 8,
|
||||||
|
'ip': '10.0.0.1', 'port': 10000, 'device': 'sdb'})
|
||||||
|
|
||||||
|
rb.add_dev({'id': 2, 'region': 0, 'region': 0, 'zone': 0, 'weight': 4,
|
||||||
|
'ip': '10.0.0.2', 'port': 10000, 'device': 'sda'})
|
||||||
|
rb.add_dev({'id': 3, 'region': 0, 'region': 0, 'zone': 0, 'weight': 4,
|
||||||
|
'ip': '10.0.0.2', 'port': 10000, 'device': 'sdb'})
|
||||||
|
|
||||||
|
rb.add_dev({'id': 4, 'region': 0, 'region': 0, 'zone': 0, 'weight': 4,
|
||||||
|
'ip': '10.0.0.3', 'port': 10000, 'device': 'sda'})
|
||||||
|
rb.add_dev({'id': 5, 'region': 0, 'region': 0, 'zone': 0, 'weight': 4,
|
||||||
|
'ip': '10.0.0.3', 'port': 10000, 'device': 'sdb'})
|
||||||
|
|
||||||
|
rb.add_dev({'id': 6, 'region': 0, 'region': 0, 'zone': 0, 'weight': 4,
|
||||||
|
'ip': '10.0.0.4', 'port': 10000, 'device': 'sda'})
|
||||||
|
rb.add_dev({'id': 7, 'region': 0, 'region': 0, 'zone': 0, 'weight': 4,
|
||||||
|
'ip': '10.0.0.4', 'port': 10000, 'device': 'sdb'})
|
||||||
|
|
||||||
|
rb.add_dev({'id': 8, 'region': 0, 'region': 0, 'zone': 0, 'weight': 4,
|
||||||
|
'ip': '10.0.0.5', 'port': 10000, 'device': 'sda'})
|
||||||
|
rb.add_dev({'id': 9, 'region': 0, 'region': 0, 'zone': 0, 'weight': 4,
|
||||||
|
'ip': '10.0.0.5', 'port': 10000, 'device': 'sdb'})
|
||||||
|
|
||||||
|
rb.set_overload(99999)
|
||||||
|
|
||||||
|
rb.rebalance(seed=123)
|
||||||
|
part_counts = self._partition_counts(rb)
|
||||||
|
self.assertEqual(part_counts, {
|
||||||
|
0: 128,
|
||||||
|
1: 128,
|
||||||
|
2: 64,
|
||||||
|
3: 64,
|
||||||
|
4: 64,
|
||||||
|
5: 64,
|
||||||
|
6: 64,
|
||||||
|
7: 64,
|
||||||
|
8: 64,
|
||||||
|
9: 64,
|
||||||
|
})
|
||||||
|
|
||||||
|
# swap weights between 10.0.0.1 and 10.0.0.2
|
||||||
|
rb.set_dev_weight(0, 4)
|
||||||
|
rb.set_dev_weight(1, 4)
|
||||||
|
rb.set_dev_weight(2, 8)
|
||||||
|
rb.set_dev_weight(1, 8)
|
||||||
|
|
||||||
|
rb.rebalance(seed=456)
|
||||||
|
part_counts = self._partition_counts(rb)
|
||||||
|
self.assertEqual(part_counts, {
|
||||||
|
0: 128,
|
||||||
|
1: 128,
|
||||||
|
2: 64,
|
||||||
|
3: 64,
|
||||||
|
4: 64,
|
||||||
|
5: 64,
|
||||||
|
6: 64,
|
||||||
|
7: 64,
|
||||||
|
8: 64,
|
||||||
|
9: 64,
|
||||||
|
})
|
||||||
|
|
||||||
def test_load(self):
|
def test_load(self):
|
||||||
rb = ring.RingBuilder(8, 3, 1)
|
rb = ring.RingBuilder(8, 3, 1)
|
||||||
devs = [{'id': 0, 'region': 0, 'zone': 0, 'weight': 1,
|
devs = [{'id': 0, 'region': 0, 'zone': 0, 'weight': 1,
|
||||||
@ -1099,6 +1272,7 @@ class TestRingBuilder(unittest.TestCase):
|
|||||||
'ip': '127.0.0.3', 'port': 10003,
|
'ip': '127.0.0.3', 'port': 10003,
|
||||||
'replication_ip': '127.0.0.3', 'replication_port': 10003,
|
'replication_ip': '127.0.0.3', 'replication_port': 10003,
|
||||||
'device': 'sdd1', 'meta': ''}]
|
'device': 'sdd1', 'meta': ''}]
|
||||||
|
rb.set_overload(3.14159)
|
||||||
for d in devs:
|
for d in devs:
|
||||||
rb.add_dev(d)
|
rb.add_dev(d)
|
||||||
rb.rebalance()
|
rb.rebalance()
|
||||||
@ -1107,6 +1281,7 @@ class TestRingBuilder(unittest.TestCase):
|
|||||||
loaded_rb = ring.RingBuilder.load(builder_file)
|
loaded_rb = ring.RingBuilder.load(builder_file)
|
||||||
self.maxDiff = None
|
self.maxDiff = None
|
||||||
self.assertEquals(loaded_rb.to_dict(), rb.to_dict())
|
self.assertEquals(loaded_rb.to_dict(), rb.to_dict())
|
||||||
|
self.assertEquals(loaded_rb.overload, 3.14159)
|
||||||
|
|
||||||
@mock.patch('__builtin__.open', autospec=True)
|
@mock.patch('__builtin__.open', autospec=True)
|
||||||
@mock.patch('swift.common.ring.builder.pickle.dump', autospec=True)
|
@mock.patch('swift.common.ring.builder.pickle.dump', autospec=True)
|
||||||
|
Loading…
x
Reference in New Issue
Block a user