It's harder than it sounds. There was really three challenges.
Challenge #1 Initial Assignment
===============================
Before starting to assign parts on this new shiny ring you've
constructed, maybe we'll pause for a moment up front and consider the
lay of the land. This process is called the replica_plan.
The replica_plan approach is separating part assignment failures into
two modes:
1) we considered the cluster topology and it's weights and came up with
the wrong plan
2) we failed to execute on the plan
I failed at both parts plenty of times before I got it this close. I'm
sure a counter example still exists, but when we find it the new helper
methods will let us reason about where things went wrong.
Challenge #2 Fixing Placement
=============================
With a sound plan in hand, it's much easier to fail to execute on it the
less material you have to execute with - so we gather up as many parts
as we can - as long as we think we can find them a better home.
Picking the right parts for gather is a black art - when you notice a
balance is slow it's because it's spending so much time iterating over
replica2part2dev trying to decide just the right parts to gather.
The replica plan can help at least in the gross dispersion collection to
gather up the worst offenders first before considering balance. I think
trying to avoid picking up parts that are stuck to the tier before
falling into a forced grab on anything over parts_wanted helps with
stability generally - but depending on where the parts_wanted are in
relation to the full devices it's pretty easy pick up something that'll
end up really close to where it started.
I tried to break the gather methods into smaller pieces so it looked
like I knew what I was doing.
Going with a MAXIMUM gather iteration instead of balance (which doesn't
reflect the replica_plan) doesn't seem to be costing me anything - most
of the time the exit condition is either solved or all the parts overly
aggressively locked up on min_part_hours. So far, it mostly seemds if
the thing is going to balance this round it'll get it in the first
couple of shakes.
Challenge #3 Crazy replica2part2dev tables
==========================================
I think there's lots of ways "scars" can build up a ring which can
result in very particular replica2part2dev tables that are physically
difficult to dig out of. It's repairing these scars that will take
multiple rebalances to resolve.
... but at this point ...
... lacking a counter example ...
I've been able to close up all the edge cases I was able to find. It
may not be quick, but progress will be made.
Basically my strategy just required a better understanding of how
previous algorithms were able to *mostly* keep things moving by brute
forcing the whole mess with a bunch of randomness. Then when we detect
our "elegant" careful part selection isn't making progress - we can fall
back to same old tricks.
Validation
==========
We validate against duplicate part replica assignment after rebalance
and raise an ERROR if we detect more than one replica of a part assigned
to the same device.
In order to meet that requirement we have to have as many devices as
replicas, so attempting to rebalance with too few devices w/o changing
your replica_count is also an ERROR not a warning.
Random Thoughts
===============
As usual with rings, the test diff can be hard to reason about -
hopefully I've added enough comments to assure future me that these
assertions make sense.
Despite being a large rewrite of a lot of important code, the existing
code is known to have failed us. This change fixes a critical bug that's
trivial to reproduce in a critical component of the system.
There's probably a bunch of error messages and exit status stuff that's
not as helpful as it could be considering the new behaviors.
Change-Id: I1bbe7be38806fc1c8b9181a722933c18a6c76e05
Closes-Bug: #1452431