swift/doc/source/overview_replication.rst
gholt a80c720af5 Object replication ssync (an rsync alternative)
For this commit, ssync is just a direct replacement for how
we use rsync. Assuming we switch over to ssync completely
someday and drop rsync, we will then be able to improve the
algorithms even further (removing local objects as we
successfully transfer each one rather than waiting for whole
partitions, using an index.db with hash-trees, etc., etc.)

For easier review, this commit can be thought of in distinct
parts:

1)  New global_conf_callback functionality for allowing
    services to perform setup code before workers, etc. are
    launched. (This is then used by ssync in the object
    server to create a cross-worker semaphore to restrict
    concurrent incoming replication.)

2)  A bit of shifting of items up from object server and
    replicator to diskfile or DEFAULT conf sections for
    better sharing of the same settings. conn_timeout,
    node_timeout, client_timeout, network_chunk_size,
    disk_chunk_size.

3)  Modifications to the object server and replicator to
    optionally use ssync in place of rsync. This is done in
    a generic enough way that switching to FutureSync should
    be easy someday.

4)  The biggest part, and (at least for now) completely
    optional part, are the new ssync_sender and
    ssync_receiver files. Nice and isolated for easier
    testing and visibility into test coverage, etc.

All the usual logging, statsd, recon, etc. instrumentation
is still there when using ssync, just as it is when using
rsync.

Beyond the essential error and exceptional condition
logging, I have not added any additional instrumentation at
this time. Unless there is something someone finds super
pressing to have added to the logging, I think such
additions would be better as separate change reviews.

FOR NOW, IT IS NOT RECOMMENDED TO USE SSYNC ON PRODUCTION
CLUSTERS. Some of us will be in a limited fashion to look
for any subtle issues, tuning, etc. but generally ssync is
an experimental feature. In its current implementation it is
probably going to be a bit slower than rsync, but if all
goes according to plan it will end up much faster.

There are no comparisions yet between ssync and rsync other
than some raw virtual machine testing I've done to show it
should compete well enough once we can put it in use in the
real world.

If you Tweet, Google+, or whatever, be sure to indicate it's
experimental. It'd be best to keep it out of deployment
guides, howtos, etc. until we all figure out if we like it,
find it to be stable, etc.

Change-Id: If003dcc6f4109e2d2a42f4873a0779110fff16d6
2013-11-07 16:52:01 +00:00

126 lines
6.3 KiB
ReStructuredText

===========
Replication
===========
Because each replica in swift functions independently, and clients generally
require only a simple majority of nodes responding to consider an operation
successful, transient failures like network partitions can quickly cause
replicas to diverge. These differences are eventually reconciled by
asynchronous, peer-to-peer replicator processes. The replicator processes
traverse their local filesystems, concurrently performing operations in a
manner that balances load across physical disks.
Replication uses a push model, with records and files generally only being
copied from local to remote replicas. This is important because data on the
node may not belong there (as in the case of handoffs and ring changes), and a
replicator can't know what data exists elsewhere in the cluster that it should
pull in. It's the duty of any node that contains data to ensure that data gets
to where it belongs. Replica placement is handled by the ring.
Every deleted record or file in the system is marked by a tombstone, so that
deletions can be replicated alongside creations. The replication process cleans
up tombstones after a time period known as the consistency window.
The consistency window encompasses replication duration and how long transient
failure can remove a node from the cluster. Tombstone cleanup must
be tied to replication to reach replica convergence.
If a replicator detects that a remote drive has failed, the replicator uses
the get_more_nodes interface for the ring to choose an alternate node with
which to synchronize. The replicator can maintain desired levels of replication
in the face of disk failures, though some replicas may not be in an immediately
usable location. Note that the replicator doesn't maintain desired levels of
replication when other failures, such as entire node failures, occur because
most failure are transient.
Replication is an area of active development, and likely rife with potential
improvements to speed and correctness.
There are two major classes of replicator - the db replicator, which
replicates accounts and containers, and the object replicator, which
replicates object data.
--------------
DB Replication
--------------
The first step performed by db replication is a low-cost hash comparison to
determine whether two replicas already match. Under normal operation,
this check is able to verify that most databases in the system are already
synchronized very quickly. If the hashes differ, the replicator brings the
databases in sync by sharing records added since the last sync point.
This sync point is a high water mark noting the last record at which two
databases were known to be in sync, and is stored in each database as a tuple
of the remote database id and record id. Database ids are unique amongst all
replicas of the database, and record ids are monotonically increasing
integers. After all new records have been pushed to the remote database, the
entire sync table of the local database is pushed, so the remote database
can guarantee that it is in sync with everything with which the local database
has previously synchronized.
If a replica is found to be missing entirely, the whole local database file is
transmitted to the peer using rsync(1) and vested with a new unique id.
In practice, DB replication can process hundreds of databases per concurrency
setting per second (up to the number of available CPUs or disks) and is bound
by the number of DB transactions that must be performed.
------------------
Object Replication
------------------
The initial implementation of object replication simply performed an rsync to
push data from a local partition to all remote servers it was expected to
exist on. While this performed adequately at small scale, replication times
skyrocketed once directory structures could no longer be held in RAM. We now
use a modification of this scheme in which a hash of the contents for each
suffix directory is saved to a per-partition hashes file. The hash for a
suffix directory is invalidated when the contents of that suffix directory are
modified.
The object replication process reads in these hash files, calculating any
invalidated hashes. It then transmits the hashes to each remote server that
should hold the partition, and only suffix directories with differing hashes
on the remote server are rsynced. After pushing files to the remote server,
the replication process notifies it to recalculate hashes for the rsynced
suffix directories.
Performance of object replication is generally bound by the number of uncached
directories it has to traverse, usually as a result of invalidated suffix
directory hashes. Using write volume and partition counts from our running
systems, it was designed so that around 2% of the hash space on a normal node
will be invalidated per day, which has experimentally given us acceptable
replication speeds.
Work continues with a new ssync method where rsync is not used at all and
instead all-Swift code is used to transfer the objects. At first, this ssync
will just strive to emulate the rsync behavior. Once deemed stable it will open
the way for future improvements in replication since we'll be able to easily
add code in the replication path instead of trying to alter the rsync code
base and distributing such modifications.
One of the first improvements planned is an "index.db" that will replace the
hashes.pkl. This will allow quicker updates to that data as well as more
streamlined queries. Quite likely we'll implement a better scheme than the
current one hashes.pkl uses (hash-trees, that sort of thing).
Another improvement planned all along the way is separating the local disk
structure from the protocol path structure. This separation will allow ring
resizing at some point, or at least ring-doubling.
FOR NOW, IT IS NOT RECOMMENDED TO USE SSYNC ON PRODUCTION CLUSTERS. Some of us
will be in a limited fashion to look for any subtle issues, tuning, etc. but
generally ssync is an experimental feature. In its current implementation it is
probably going to be a bit slower than RSync, but if all goes according to plan
it will end up much faster.
-----------------------------
Dedicated replication network
-----------------------------
Swift has support for using dedicated network for replication traffic.
For more information see :ref:`Overview of dedicated replication network
<Dedicated-replication-network>`.