Add notion of overload to swift-ring-builder

The ring builder's placement algorithm has two goals: first, to ensure
that each partition has its replicas as far apart as possible, and
second, to ensure that partitions are fairly distributed according to
device weight. In many cases, it succeeds in both, but sometimes those
goals conflict. When that happens, operators may want to relax the
rules a little bit in order to reach a compromise solution.

Imagine a cluster of 3 nodes (A, B, C), each with 20 identical disks,
and using 3 replicas. The ring builder will place 1 replica of each
partition on each node, as you'd expect.

Now imagine that one disk fails in node C and is removed from the
ring. The operator would probably be okay with remaining at 1 replica
per node (unless their disks are really close to full), but to
accomplish that, they have to multiply the weights of the other disks
in node C by 20/19 to make C's total weight stay the same. Otherwise,
the ring builder will move partitions around such that some partitions
have replicas only on nodes A and B.

If 14 more disks failed in node C, the operator would probably be okay
with some data not living on C, as a 4x increase in storage
requirements is likely to fill disks.

This commit introduces the notion of "overload": how much extra
partition space can be placed on each disk *over* what the weight
dictates.

For example, an overload of 0.1 means that a device can take up to 10%
more partitions than its weight would imply in order to make the
replica dispersion better.

Overload only has an effect when replica-dispersion and device weights
come into conflict.

The overload is a single floating-point value for the builder
file. Existing builders get an overload of 0.0, so there will be no
behavior change on existing rings.

In the example above, imagine the operator sets an overload of 0.112
on his rings. If node C loses a drive, each other drive can take on up
to 11.2% more data. Splitting the dead drive's partitions among the
remaining 19 results in a 5.26% increase, so everything that was on
node C stays on node C. If another disk dies, then we're up to an
11.1% increase, and so everything still stays on node C. If a third
disk dies, then we've reached the limits of the overload, so some
partitions will begin to reside solely on nodes A and B.

DocImpact

Change-Id: I3593a1defcd63b6ed8eae9c1c66b9d3428b33864
This commit is contained in:
Samuel Merritt 2014-12-17 13:48:42 -08:00
parent 4cdb51418c
commit bcf26f5209
5 changed files with 420 additions and 28 deletions

View File

@ -130,6 +130,43 @@ for the ring. This means that some partitions will have more replicas than
others. For example, if a ring has 3.25 replicas, then 25% of its partitions others. For example, if a ring has 3.25 replicas, then 25% of its partitions
will have four replicas, while the remaining 75% will have just three. will have four replicas, while the remaining 75% will have just three.
********
Overload
********
The ring builder tries to keep replicas as far apart as possible while
still respecting device weights. When it can't do both, the overload
factor determines what happens. Each device will take some extra
fraction of its desired partitions to allow for replica dispersion;
once that extra fraction is exhausted, replicas will be placed closer
together than optimal.
Essentially, the overload factor lets the operator trade off replica
dispersion (durability) against data dispersion (uniform disk usage).
The default overload factor is 0, so device weights will be strictly
followed.
With an overload factor of 0.1, each device will accept 10% more
partitions than it otherwise would, but only if needed to maintain
partition dispersion.
Example: Consider a 3-node cluster of machines with equal-size disks;
let node A have 12 disks, node B have 12 disks, and node C have only
11 disks. Let the ring have an overload factor of 0.1 (10%).
Without the overload, some partitions would end up with replicas only
on nodes A and B. However, with the overload, every device is willing
to accept up to 10% more partitions for the sake of dispersion. The
missing disk in C means there is one disk's worth of partitions that
would like to spread across the remaining 11 disks, which gives each
disk in C an extra 9.09% load. Since this is less than the 10%
overload, there is one replica of each partition on each node.
However, this does mean that the disks in node C will have more data
on them than the disks in nodes A and B. If 80% full is the warning
threshold for the cluster, node C's disks will reach 80% full while A
and B's disks are only 72.7% full.
********************* *********************
Partition Shift Value Partition Shift Value
@ -269,3 +306,17 @@ faster, but MD5 was built-in and hash computation is a small percentage of the
overall request handling time. In all, once it was decided the servers wouldn't overall request handling time. In all, once it was decided the servers wouldn't
be maintaining the rings themselves anyway and only doing hash lookups, MD5 was be maintaining the rings themselves anyway and only doing hash lookups, MD5 was
chosen for its general availability, good distribution, and adequate speed. chosen for its general availability, good distribution, and adequate speed.
The placement algorithm has seen a number of behavioral changes for
unbalanceable rings. The ring builder wants to keep replicas as far
apart as possible while still respecting device weights. In most
cases, the ring builder can achieve both, but sometimes they conflict.
At first, the behavior was to keep the replicas far apart and ignore
device weight, but that made it impossible to gradually go from one
region to two, or from two to three. Then it was changed to favor
device weight over dispersion, but that wasn't so good for rings that
were close to balanceable, like 3 machines with 60TB, 60TB, and 57TB
of disk space; operators were expecting one replica per machine, but
didn't always get it. After that, overload was added to the ring
builder so that operators could choose a balance between dispersion
and device weights.

View File

@ -251,6 +251,7 @@ swift-ring-builder <builder_file>
balance) balance)
print 'The minimum number of hours before a partition can be ' \ print 'The minimum number of hours before a partition can be ' \
'reassigned is %s' % builder.min_part_hours 'reassigned is %s' % builder.min_part_hours
print 'The overload factor is %.6f' % builder.overload
if builder.devs: if builder.devs:
print 'Devices: id region zone ip address port ' \ print 'Devices: id region zone ip address port ' \
'replication ip replication port name ' \ 'replication ip replication port name ' \
@ -650,7 +651,7 @@ swift-ring-builder <builder_file> rebalance <seed>
print 'Reassigned %d (%.02f%%) partitions. Balance is now %.02f.' % \ print 'Reassigned %d (%.02f%%) partitions. Balance is now %.02f.' % \
(parts, 100.0 * parts / builder.parts, balance) (parts, 100.0 * parts / builder.parts, balance)
status = EXIT_SUCCESS status = EXIT_SUCCESS
if balance > 5: if balance > 5 and balance / 100.0 > builder.overload:
print '-' * 79 print '-' * 79
print 'NOTE: Balance of %.02f indicates you should push this ' % \ print 'NOTE: Balance of %.02f indicates you should push this ' % \
balance balance
@ -794,6 +795,35 @@ swift-ring-builder <builder_file> set_replicas <replicas>
builder.save(argv[1]) builder.save(argv[1])
exit(EXIT_SUCCESS) exit(EXIT_SUCCESS)
def set_overload():
"""
swift-ring-builder <builder_file> set_overload <overload>
Changes the overload factor to the given <overload>.
A rebalance is needed to make the change take effect.
"""
if len(argv) < 4:
print Commands.set_overload.__doc__.strip()
exit(EXIT_ERROR)
new_overload = argv[3]
try:
new_overload = float(new_overload)
except ValueError:
print Commands.set_overload.__doc__.strip()
print "\"%s\" is not a valid number." % new_overload
exit(EXIT_ERROR)
if new_overload < 0:
print "Overload must be non-negative."
exit(EXIT_ERROR)
builder.set_overload(new_overload)
print 'The overload is now %.6f.' % builder.overload
print 'The change will take effect after the next rebalance.'
builder.save(argv[1])
exit(EXIT_SUCCESS)
def main(arguments=None): def main(arguments=None):
global argv, backup_dir, builder, builder_file, ring_file global argv, backup_dir, builder, builder_file, ring_file

View File

@ -66,6 +66,7 @@ class RingBuilder(object):
self.devs = [] self.devs = []
self.devs_changed = False self.devs_changed = False
self.version = 0 self.version = 0
self.overload = 0.0
# _replica2part2dev maps from replica number to partition number to # _replica2part2dev maps from replica number to partition number to
# device id. So, for a three replica, 2**23 ring, it's an array of # device id. So, for a three replica, 2**23 ring, it's an array of
@ -122,6 +123,7 @@ class RingBuilder(object):
self.parts = builder.parts self.parts = builder.parts
self.devs = builder.devs self.devs = builder.devs
self.devs_changed = builder.devs_changed self.devs_changed = builder.devs_changed
self.overload = builder.overload
self.version = builder.version self.version = builder.version
self._replica2part2dev = builder._replica2part2dev self._replica2part2dev = builder._replica2part2dev
self._last_part_moves_epoch = builder._last_part_moves_epoch self._last_part_moves_epoch = builder._last_part_moves_epoch
@ -135,6 +137,7 @@ class RingBuilder(object):
self.parts = builder['parts'] self.parts = builder['parts']
self.devs = builder['devs'] self.devs = builder['devs']
self.devs_changed = builder['devs_changed'] self.devs_changed = builder['devs_changed']
self.overload = builder.get('overload', 0.0)
self.version = builder['version'] self.version = builder['version']
self._replica2part2dev = builder['_replica2part2dev'] self._replica2part2dev = builder['_replica2part2dev']
self._last_part_moves_epoch = builder['_last_part_moves_epoch'] self._last_part_moves_epoch = builder['_last_part_moves_epoch']
@ -162,6 +165,7 @@ class RingBuilder(object):
'devs': self.devs, 'devs': self.devs,
'devs_changed': self.devs_changed, 'devs_changed': self.devs_changed,
'version': self.version, 'version': self.version,
'overload': self.overload,
'_replica2part2dev': self._replica2part2dev, '_replica2part2dev': self._replica2part2dev,
'_last_part_moves_epoch': self._last_part_moves_epoch, '_last_part_moves_epoch': self._last_part_moves_epoch,
'_last_part_moves': self._last_part_moves, '_last_part_moves': self._last_part_moves,
@ -202,6 +206,9 @@ class RingBuilder(object):
self.replicas = new_replica_count self.replicas = new_replica_count
def set_overload(self, overload):
self.overload = overload
def get_ring(self): def get_ring(self):
""" """
Get the ring, or more specifically, the swift.common.ring.RingData. Get the ring, or more specifically, the swift.common.ring.RingData.
@ -545,8 +552,8 @@ class RingBuilder(object):
# the last would not, probably resulting in a crash. This # the last would not, probably resulting in a crash. This
# way, some devices end up with leftover parts_wanted, but # way, some devices end up with leftover parts_wanted, but
# at least every partition ends up somewhere. # at least every partition ends up somewhere.
int(math.ceil(weight_of_one_part * dev['weight'])) - int(math.ceil(weight_of_one_part * dev['weight']
dev['parts']) - dev['parts'])))
def _adjust_replica2part2dev_size(self): def _adjust_replica2part2dev_size(self):
""" """
@ -655,10 +662,12 @@ class RingBuilder(object):
""" """
wanted_parts_for_tier = {} wanted_parts_for_tier = {}
for dev in self._iter_devs(): for dev in self._iter_devs():
pw = max(0, dev['parts_wanted']) pw = (max(0, dev['parts_wanted']) +
max(int(math.ceil(
(dev['parts_wanted'] + dev['parts']) * self.overload)),
0))
for tier in tiers_for_dev(dev): for tier in tiers_for_dev(dev):
if tier not in wanted_parts_for_tier: wanted_parts_for_tier.setdefault(tier, 0)
wanted_parts_for_tier[tier] = 0
wanted_parts_for_tier[tier] += pw wanted_parts_for_tier[tier] += pw
return wanted_parts_for_tier return wanted_parts_for_tier
@ -847,24 +856,30 @@ class RingBuilder(object):
replicas_to_replace may be shared for multiple replicas_to_replace may be shared for multiple
partitions, so be sure you do not modify it. partitions, so be sure you do not modify it.
""" """
fudge_available_in_tier = defaultdict(int)
parts_available_in_tier = defaultdict(int) parts_available_in_tier = defaultdict(int)
for dev in self._iter_devs(): for dev in self._iter_devs():
dev['sort_key'] = self._sort_key_for(dev) dev['sort_key'] = self._sort_key_for(dev)
tiers = tiers_for_dev(dev) tiers = tiers_for_dev(dev)
dev['tiers'] = tiers dev['tiers'] = tiers
# Note: this represents how many partitions may be assigned to a
# given tier (region/zone/server/disk). It does not take into
# account how many partitions a given tier wants to shed.
#
# If we did not do this, we could have a zone where, at some
# point during assignment, number-of-parts-to-gain equals
# number-of-parts-to-shed. At that point, no further placement
# into that zone would occur since its parts_available_in_tier
# would be 0. This would happen any time a zone had any device
# with partitions to shed, which is any time a device is being
# removed, which is a pretty frequent operation.
wanted = max(dev['parts_wanted'], 0)
fudge = max(int(math.ceil(
(dev['parts_wanted'] + dev['parts']) * self.overload)),
0)
for tier in tiers: for tier in tiers:
# Note: this represents how many partitions may be assigned to fudge_available_in_tier[tier] += (wanted + fudge)
# a given tier (region/zone/server/disk). It does not take parts_available_in_tier[tier] += wanted
# into account how many partitions a given tier wants to shed.
#
# If we did not do this, we could have a zone where, at some
# point during assignment, number-of-parts-to-gain equals
# number-of-parts-to-shed. At that point, no further placement
# into that zone would occur since its parts_available_in_tier
# would be 0. This would happen any time a zone had any device
# with partitions to shed, which is any time a device is being
# removed, which is a pretty frequent operation.
parts_available_in_tier[tier] += max(dev['parts_wanted'], 0)
available_devs = \ available_devs = \
sorted((d for d in self._iter_devs() if d['weight']), sorted((d for d in self._iter_devs() if d['weight']),
@ -916,6 +931,7 @@ class RingBuilder(object):
tier = () tier = ()
depth = 1 depth = 1
while depth <= max_tier_depth: while depth <= max_tier_depth:
roomiest_tier = fudgiest_tier = None
# Order the tiers by how many replicas of this # Order the tiers by how many replicas of this
# partition they already have. Then, of the ones # partition they already have. Then, of the ones
# with the smallest number of replicas and that have # with the smallest number of replicas and that have
@ -954,22 +970,43 @@ class RingBuilder(object):
candidates_with_room = [ candidates_with_room = [
t for t in tier2children[tier] t for t in tier2children[tier]
if parts_available_in_tier[t] > 0] if parts_available_in_tier[t] > 0]
candidates_with_fudge = set([
t for t in tier2children[tier]
if fudge_available_in_tier[t] > 0])
candidates_with_fudge.update(candidates_with_room)
if len(candidates_with_room) > \ if candidates_with_room:
len(candidates_with_replicas): if len(candidates_with_room) > \
len(candidates_with_replicas):
# There exists at least one tier with room for # There exists at least one tier with room for
# another partition and 0 other replicas already # another partition and 0 other replicas already in
# in it, so we can use a faster search. The else # it, so we can use a faster search. The else
# branch's search would work here, but it's # branch's search would work here, but it's
# significantly slower. # significantly slower.
tier = max((t for t in candidates_with_room roomiest_tier = max(
if other_replicas[t] == 0), (t for t in candidates_with_room
key=tier2sort_key.__getitem__) if other_replicas[t] == 0),
key=tier2sort_key.__getitem__)
else:
roomiest_tier = max(
candidates_with_room,
key=lambda t: (-other_replicas[t],
tier2sort_key[t]))
else: else:
tier = max(candidates_with_room, roomiest_tier = None
key=lambda t: (-other_replicas[t],
tier2sort_key[t])) fudgiest_tier = max(candidates_with_fudge,
key=lambda t: (-other_replicas[t],
tier2sort_key[t]))
if (roomiest_tier is None or
(other_replicas[roomiest_tier] >
other_replicas[fudgiest_tier])):
tier = fudgiest_tier
else:
tier = roomiest_tier
depth += 1 depth += 1
dev = tier2devs[tier][-1] dev = tier2devs[tier][-1]
dev['parts_wanted'] -= 1 dev['parts_wanted'] -= 1
dev['parts'] += 1 dev['parts'] += 1
@ -977,6 +1014,7 @@ class RingBuilder(object):
new_sort_key = dev['sort_key'] = self._sort_key_for(dev) new_sort_key = dev['sort_key'] = self._sort_key_for(dev)
for tier in dev['tiers']: for tier in dev['tiers']:
parts_available_in_tier[tier] -= 1 parts_available_in_tier[tier] -= 1
fudge_available_in_tier[tier] -= 1
other_replicas[tier] += 1 other_replicas[tier] += 1
occupied_tiers_by_tier_len[len(tier)].add(tier) occupied_tiers_by_tier_len[len(tier)].add(tier)

View File

@ -15,6 +15,7 @@
import mock import mock
import os import os
import StringIO
import tempfile import tempfile
import unittest import unittest
import uuid import uuid
@ -213,6 +214,27 @@ class TestCommands(unittest.TestCase):
ring = RingBuilder.load(self.tmpfile) ring = RingBuilder.load(self.tmpfile)
self.assertEqual(ring.replicas, 3.14159265359) self.assertEqual(ring.replicas, 3.14159265359)
def test_set_overload(self):
self.create_sample_ring()
argv = ["", self.tmpfile, "set_overload", "0.19878"]
self.assertRaises(SystemExit, swift.cli.ringbuilder.main, argv)
ring = RingBuilder.load(self.tmpfile)
self.assertEqual(ring.overload, 0.19878)
def test_set_overload_negative(self):
self.create_sample_ring()
argv = ["", self.tmpfile, "set_overload", "-0.19878"]
self.assertRaises(SystemExit, swift.cli.ringbuilder.main, argv)
ring = RingBuilder.load(self.tmpfile)
self.assertEqual(ring.overload, 0.0)
def test_set_overload_non_numeric(self):
self.create_sample_ring()
argv = ["", self.tmpfile, "set_overload", "swedish fish"]
self.assertRaises(SystemExit, swift.cli.ringbuilder.main, argv)
ring = RingBuilder.load(self.tmpfile)
self.assertEqual(ring.overload, 0.0)
def test_validate(self): def test_validate(self):
self.create_sample_ring() self.create_sample_ring()
ring = RingBuilder.load(self.tmpfile) ring = RingBuilder.load(self.tmpfile)
@ -273,5 +295,81 @@ class TestCommands(unittest.TestCase):
except SystemExit as e: except SystemExit as e:
self.assertEquals(e.code, 2) self.assertEquals(e.code, 2)
class TestRebalanceCommand(unittest.TestCase):
def __init__(self, *args, **kwargs):
super(TestRebalanceCommand, self).__init__(*args, **kwargs)
tmpf = tempfile.NamedTemporaryFile()
self.tempfile = tmpf.name
def tearDown(self):
try:
os.remove(self.tempfile)
except OSError:
pass
def run_srb(self, *argv):
mock_stdout = StringIO.StringIO()
mock_stderr = StringIO.StringIO()
srb_args = ["", self.tempfile] + [str(s) for s in argv]
try:
with mock.patch("sys.stdout", mock_stdout):
with mock.patch("sys.stderr", mock_stderr):
swift.cli.ringbuilder.main(srb_args)
except SystemExit as err:
if err.code not in (0, 1): # (success, warning)
raise
return (mock_stdout.getvalue(), mock_stderr.getvalue())
def test_rebalance_warning_appears(self):
self.run_srb("create", 8, 3, 24)
# all in one machine: totally balanceable
self.run_srb("add",
"r1z1-10.1.1.1:2345/sda", 100.0,
"r1z1-10.1.1.1:2345/sdb", 100.0,
"r1z1-10.1.1.1:2345/sdc", 100.0,
"r1z1-10.1.1.1:2345/sdd", 100.0)
out, err = self.run_srb("rebalance")
self.assertTrue("rebalance/repush" not in out)
# 2 machines of equal size: balanceable, but not in one pass due to
# min_part_hours > 0
self.run_srb("add",
"r1z1-10.1.1.2:2345/sda", 100.0,
"r1z1-10.1.1.2:2345/sdb", 100.0,
"r1z1-10.1.1.2:2345/sdc", 100.0,
"r1z1-10.1.1.2:2345/sdd", 100.0)
self.run_srb("pretend_min_part_hours_passed")
out, err = self.run_srb("rebalance")
self.assertTrue("rebalance/repush" in out)
# after two passes, it's all balanced out
self.run_srb("pretend_min_part_hours_passed")
out, err = self.run_srb("rebalance")
self.assertTrue("rebalance/repush" not in out)
def test_rebalance_warning_with_overload(self):
self.run_srb("create", 8, 3, 24)
self.run_srb("set_overload", 0.12)
# The ring's balance is at least 5, so normally we'd get a warning,
# but it's suppressed due to the overload factor.
self.run_srb("add",
"r1z1-10.1.1.1:2345/sda", 100.0,
"r1z1-10.1.1.1:2345/sdb", 100.0,
"r1z1-10.1.1.1:2345/sdc", 120.0)
out, err = self.run_srb("rebalance")
self.assertTrue("rebalance/repush" not in out)
# Now we add in a really big device, but not enough partitions move
# to fill it in one pass, so we see the rebalance warning.
self.run_srb("add", "r1z1-10.1.1.1:2345/sdd", 99999.0)
self.run_srb("pretend_min_part_hours_passed")
out, err = self.run_srb("rebalance")
self.assertTrue("rebalance/repush" in out)
if __name__ == '__main__': if __name__ == '__main__':
unittest.main() unittest.main()

View File

@ -38,6 +38,17 @@ class TestRingBuilder(unittest.TestCase):
def tearDown(self): def tearDown(self):
rmtree(self.testdir, ignore_errors=1) rmtree(self.testdir, ignore_errors=1)
def _partition_counts(self, builder):
"""
Returns a dictionary mapping (device ID) to (number of partitions
assigned to that device).
"""
counts = {}
for part2dev_id in builder._replica2part2dev:
for dev_id in part2dev_id:
counts[dev_id] = counts.get(dev_id, 0) + 1
return counts
def _get_population_by_region(self, builder): def _get_population_by_region(self, builder):
""" """
Returns a dictionary mapping region to number of partitions in that Returns a dictionary mapping region to number of partitions in that
@ -984,6 +995,168 @@ class TestRingBuilder(unittest.TestCase):
8: 192, 9: 192, 8: 192, 9: 192,
10: 64, 11: 64}) 10: 64, 11: 64})
def test_overload(self):
rb = ring.RingBuilder(8, 3, 1)
rb.add_dev({'id': 0, 'region': 0, 'region': 0, 'zone': 0, 'weight': 1,
'ip': '127.0.0.1', 'port': 10000, 'device': 'sda'})
rb.add_dev({'id': 1, 'region': 0, 'region': 0, 'zone': 1, 'weight': 1,
'ip': '127.0.0.1', 'port': 10001, 'device': 'sdb'})
rb.add_dev({'id': 2, 'region': 0, 'region': 0, 'zone': 2, 'weight': 2,
'ip': '127.0.0.2', 'port': 10002, 'device': 'sdc'})
rb.rebalance(seed=12345)
# sanity check: balance respects weights, so default
part_counts = self._partition_counts(rb)
self.assertEqual(part_counts[0], 192)
self.assertEqual(part_counts[1], 192)
self.assertEqual(part_counts[2], 384)
# Devices 0 and 1 take 10% more than their fair shares by weight since
# overload is 10% (0.1).
rb.set_overload(0.1)
for _ in range(2):
rb.pretend_min_part_hours_passed()
rb.rebalance(seed=12345)
part_counts = self._partition_counts(rb)
self.assertEqual(part_counts[0], 212)
self.assertEqual(part_counts[1], 212)
self.assertEqual(part_counts[2], 344)
# Now, devices 0 and 1 take 50% more than their fair shares by
# weight.
rb.set_overload(0.5)
for _ in range(3):
rb.pretend_min_part_hours_passed()
rb.rebalance(seed=12345)
part_counts = self._partition_counts(rb)
self.assertEqual(part_counts[0], 256)
self.assertEqual(part_counts[1], 256)
self.assertEqual(part_counts[2], 256)
# Devices 0 and 1 may take up to 75% over their fair share, but the
# placement algorithm only wants to spread things out evenly between
# all drives, so the devices stay at 50% more.
rb.set_overload(0.75)
for _ in range(3):
rb.pretend_min_part_hours_passed()
rb.rebalance(seed=12345)
part_counts = self._partition_counts(rb)
self.assertEqual(part_counts[0], 256)
self.assertEqual(part_counts[1], 256)
self.assertEqual(part_counts[2], 256)
def test_overload_keeps_balanceable_things_balanced_initially(self):
rb = ring.RingBuilder(8, 3, 1)
rb.add_dev({'id': 0, 'region': 0, 'region': 0, 'zone': 0, 'weight': 8,
'ip': '10.0.0.1', 'port': 10000, 'device': 'sda'})
rb.add_dev({'id': 1, 'region': 0, 'region': 0, 'zone': 0, 'weight': 8,
'ip': '10.0.0.1', 'port': 10000, 'device': 'sdb'})
rb.add_dev({'id': 2, 'region': 0, 'region': 0, 'zone': 0, 'weight': 4,
'ip': '10.0.0.2', 'port': 10000, 'device': 'sda'})
rb.add_dev({'id': 3, 'region': 0, 'region': 0, 'zone': 0, 'weight': 4,
'ip': '10.0.0.2', 'port': 10000, 'device': 'sdb'})
rb.add_dev({'id': 4, 'region': 0, 'region': 0, 'zone': 0, 'weight': 4,
'ip': '10.0.0.3', 'port': 10000, 'device': 'sda'})
rb.add_dev({'id': 5, 'region': 0, 'region': 0, 'zone': 0, 'weight': 4,
'ip': '10.0.0.3', 'port': 10000, 'device': 'sdb'})
rb.add_dev({'id': 6, 'region': 0, 'region': 0, 'zone': 0, 'weight': 4,
'ip': '10.0.0.4', 'port': 10000, 'device': 'sda'})
rb.add_dev({'id': 7, 'region': 0, 'region': 0, 'zone': 0, 'weight': 4,
'ip': '10.0.0.4', 'port': 10000, 'device': 'sdb'})
rb.add_dev({'id': 8, 'region': 0, 'region': 0, 'zone': 0, 'weight': 4,
'ip': '10.0.0.5', 'port': 10000, 'device': 'sda'})
rb.add_dev({'id': 9, 'region': 0, 'region': 0, 'zone': 0, 'weight': 4,
'ip': '10.0.0.5', 'port': 10000, 'device': 'sdb'})
rb.set_overload(99999)
rb.rebalance(seed=12345)
part_counts = self._partition_counts(rb)
self.assertEqual(part_counts, {
0: 128,
1: 128,
2: 64,
3: 64,
4: 64,
5: 64,
6: 64,
7: 64,
8: 64,
9: 64,
})
def test_overload_keeps_balanceable_things_balanced_on_rebalance(self):
rb = ring.RingBuilder(8, 3, 1)
rb.add_dev({'id': 0, 'region': 0, 'region': 0, 'zone': 0, 'weight': 8,
'ip': '10.0.0.1', 'port': 10000, 'device': 'sda'})
rb.add_dev({'id': 1, 'region': 0, 'region': 0, 'zone': 0, 'weight': 8,
'ip': '10.0.0.1', 'port': 10000, 'device': 'sdb'})
rb.add_dev({'id': 2, 'region': 0, 'region': 0, 'zone': 0, 'weight': 4,
'ip': '10.0.0.2', 'port': 10000, 'device': 'sda'})
rb.add_dev({'id': 3, 'region': 0, 'region': 0, 'zone': 0, 'weight': 4,
'ip': '10.0.0.2', 'port': 10000, 'device': 'sdb'})
rb.add_dev({'id': 4, 'region': 0, 'region': 0, 'zone': 0, 'weight': 4,
'ip': '10.0.0.3', 'port': 10000, 'device': 'sda'})
rb.add_dev({'id': 5, 'region': 0, 'region': 0, 'zone': 0, 'weight': 4,
'ip': '10.0.0.3', 'port': 10000, 'device': 'sdb'})
rb.add_dev({'id': 6, 'region': 0, 'region': 0, 'zone': 0, 'weight': 4,
'ip': '10.0.0.4', 'port': 10000, 'device': 'sda'})
rb.add_dev({'id': 7, 'region': 0, 'region': 0, 'zone': 0, 'weight': 4,
'ip': '10.0.0.4', 'port': 10000, 'device': 'sdb'})
rb.add_dev({'id': 8, 'region': 0, 'region': 0, 'zone': 0, 'weight': 4,
'ip': '10.0.0.5', 'port': 10000, 'device': 'sda'})
rb.add_dev({'id': 9, 'region': 0, 'region': 0, 'zone': 0, 'weight': 4,
'ip': '10.0.0.5', 'port': 10000, 'device': 'sdb'})
rb.set_overload(99999)
rb.rebalance(seed=123)
part_counts = self._partition_counts(rb)
self.assertEqual(part_counts, {
0: 128,
1: 128,
2: 64,
3: 64,
4: 64,
5: 64,
6: 64,
7: 64,
8: 64,
9: 64,
})
# swap weights between 10.0.0.1 and 10.0.0.2
rb.set_dev_weight(0, 4)
rb.set_dev_weight(1, 4)
rb.set_dev_weight(2, 8)
rb.set_dev_weight(1, 8)
rb.rebalance(seed=456)
part_counts = self._partition_counts(rb)
self.assertEqual(part_counts, {
0: 128,
1: 128,
2: 64,
3: 64,
4: 64,
5: 64,
6: 64,
7: 64,
8: 64,
9: 64,
})
def test_load(self): def test_load(self):
rb = ring.RingBuilder(8, 3, 1) rb = ring.RingBuilder(8, 3, 1)
devs = [{'id': 0, 'region': 0, 'zone': 0, 'weight': 1, devs = [{'id': 0, 'region': 0, 'zone': 0, 'weight': 1,
@ -1099,6 +1272,7 @@ class TestRingBuilder(unittest.TestCase):
'ip': '127.0.0.3', 'port': 10003, 'ip': '127.0.0.3', 'port': 10003,
'replication_ip': '127.0.0.3', 'replication_port': 10003, 'replication_ip': '127.0.0.3', 'replication_port': 10003,
'device': 'sdd1', 'meta': ''}] 'device': 'sdd1', 'meta': ''}]
rb.set_overload(3.14159)
for d in devs: for d in devs:
rb.add_dev(d) rb.add_dev(d)
rb.rebalance() rb.rebalance()
@ -1107,6 +1281,7 @@ class TestRingBuilder(unittest.TestCase):
loaded_rb = ring.RingBuilder.load(builder_file) loaded_rb = ring.RingBuilder.load(builder_file)
self.maxDiff = None self.maxDiff = None
self.assertEquals(loaded_rb.to_dict(), rb.to_dict()) self.assertEquals(loaded_rb.to_dict(), rb.to_dict())
self.assertEquals(loaded_rb.overload, 3.14159)
@mock.patch('__builtin__.open', autospec=True) @mock.patch('__builtin__.open', autospec=True)
@mock.patch('swift.common.ring.builder.pickle.dump', autospec=True) @mock.patch('swift.common.ring.builder.pickle.dump', autospec=True)