404ac092d1
As I understand it db replication starts with a preflight sync request to the remote container server who's response will include the last synced row_id that it has on file for the sending nodes database id. If the difference in the last sync point returned is more than 50% of the local sending db's rows, it'll fall back to sending the whole db over rsync and let the remote end merge items locally - but generally there's just a few rows missing and they're shipped over the wire as json and stuffed into some rather normal looking merge_items calls. The one thing that's a bit different with these remote merge_items calls (compared to your average run of the mill eat a bunch of entries out of a .pending file) is the is source kwarg. When this optional kwarg comes into merge_items it's the remote sending db's uuid, and after we eat all the rows it sent us we update our local incoming_sync table for that uuid so that next time when it makes it's pre-flight sync request we can tell it where it left off. Now normally the sending db is going to push out it's rows up from the returned sync_point in 1000 item diffs, up to 10 batches total (per_diff and max_diffs options) - 10K rows. If that goes well then everything is in sync up to at least the point it started, and the sending db will *also* ship over *it's* incoming_sync rows to merge_syncs on the remote end. Since the sending db is in sync with these other db's up to those points so is the remote db now by way of the transitive property. Also note through some weird artifact that I'm not entirely convinced isn't an unrelated and possibly benign bug the incoming_sync table on the sending db will often also happen to include it's own uuid - maybe it got pushed back to it from another node? Anyway, that seemed to work well enough until a sending db got diff capped (i.e. sent it's 10K rows and wasn't finished), when this happened the final merge_syncs call never gets sent because the remote end is definitely *not* up to date with the other databases that the sending db is - it's not even up-to-date with the sending db yet! But the hope is certainly that on the next pass it'll be able to finish sending the remaining items. But since the remote end is who decides what the last successfully synced row with this local sending db was - it's super important that the incoming_sync table is getting updated in merge_items when that source kwarg is there. I observed this simple and straight forward process wasn't working well in one case - which is weird considering it didn't have much in the way of tests. After I had the test and started looking into it seemed maybe the source kwarg handling got over-indented a bit in the bulk insert merge_items refactor. I think this is correct - maybe we could send someone up to the mountain temple to seek out gholt? Change-Id: I4137388a97925814748ecc36b3ab5f1ac3309659