swift/doc/source/overview_container_sharding.rst
Matthew Oliver 7f5e34a405 sharidng: update doc to only mention auto_shard experimental
There are been members of the community running sharding in production
and it's awesome. It's just the auto-sharding swift of that remains
experimental.

This patch removes the big sharding warning from the top of the
sharding overview page and better emphasises that it's the audo_shard
option that isn't ready for production use.

Change-Id: Id2c842cffad58fb6fd5e1d12619c46ffcb38f8a5
2021-10-19 12:53:04 +11:00

32 KiB

Container Sharding

Container sharding is an operator controlled feature that may be used to shard very large container databases into a number of smaller shard containers

Note

It is strongly recommended that operators gain experience of sharding containers in a non-production cluster before using in production.

The sharding process involves moving all sharding container database records via the container replication engine; the time taken to complete sharding is dependent upon the existing cluster load and the performance of the container database being sharded.

There is currently no documented process for reversing the sharding process once sharding has been enabled.

Background

The metadata for each container in Swift is stored in an SQLite database. This metadata includes: information about the container such as its name, modification time and current object count; user metadata that may been written to the container by clients; a record of every object in the container. The container database object records are used to generate container listings in response to container GET requests; each object record stores the object's name, size, hash and content-type as well as associated timestamps.

As the number of objects in a container increases then the number of object records in the container database increases. Eventually the container database performance starts to degrade and the time taken to update an object record increases. This can result in object updates timing out, with a corresponding increase in the backlog of pending asynchronous updates <architecture_updaters> on object servers. Container databases are typically replicated on several nodes and any database performance degradation can also result in longer container replication <overview_replication> times.

The point at which container database performance starts to degrade depends upon the choice of hardware in the container ring. Anecdotal evidence suggests that containers with tens of millions of object records have noticeably degraded performance.

This performance degradation can be avoided by ensuring that clients use an object naming scheme that disperses objects across a number of containers thereby distributing load across a number of container databases. However, that is not always desirable nor is it under the control of the cluster operator.

Swift's container sharding feature provides the operator with a mechanism to distribute the load on a single client-visible container across multiple, hidden, shard containers, each of which stores a subset of the container's object records. Clients are unaware of container sharding; clients continue to use the same API to access a container that, if sharded, maps to a number of shard containers within the Swift cluster.

Deployment and operation

Upgrade Considerations

It is essential that all servers in a Swift cluster have been upgraded to support the container sharding feature before attempting to shard a container.

Identifying containers in need of sharding

Container sharding is currently initiated by the swift-manage-shard-ranges CLI tool described below <swift-manage-shard-ranges>. Operators must first identify containers that are candidates for sharding. To assist with this, the sharder_daemon inspects the size of containers that it visits and writes a list of sharding candidates to recon cache. For example:

"sharding_candidates": {
    "found": 1,
    "top": [
        {
            "account": "AUTH_test",
            "container": "c1",
            "file_size": 497763328,
            "meta_timestamp": "1525346445.31161",
            "node_index": 2,
            "object_count": 3349028,
            "path": <path_to_db>,
            "root": "AUTH_test/c1"
        }
    ]
}

A container is considered to be a sharding candidate if its object count is greater than or equal to the shard_container_threshold option. The number of candidates reported is limited to a number configured by the recon_candidates_limit option such that only the largest candidate containers are included in the sharding_candidates data.

swift-manage-shard-ranges CLI tool

swift.cli.manage_shard_ranges

container-sharder daemon

Once sharding has been enabled for a container, the act of sharding is performed by the container-sharder. The container-sharder daemon must be running on all container servers. The container-sharder daemon periodically visits each container database to perform any container sharding tasks that are required.

The container-sharder daemon requires a [container-sharder] config section to exist in the container server configuration file; a sample config section is shown in the container-server.conf-sample file.

Note

The auto_shard option is currently NOT recommended for production systems and should be set to false (the default value).

Several of the [container-sharder] config options are only significant when the auto_shard option is enabled. This option enables the container-sharder daemon to automatically identify containers that are candidates for sharding and initiate the sharding process, instead of using the swift-manage-shard-ranges tool.

The container sharder uses an internal client and therefore requires an internal client configuration file to exist. By default the internal-client configuration file is expected to be found at /etc/swift/internal-client.conf. An alternative location for the configuration file may be specified using the internal_client_conf_path option in the [container-sharder] config section.

The content of the internal-client configuration file should be the same as the internal-client.conf-sample file. In particular, the internal-client configuration should have:

account_autocreate = True

in the [proxy-server] section.

A container database may require several visits by the container-sharder daemon before it is fully sharded. On each visit the container-sharder daemon will move a subset of object records to new shard containers by cleaving new shard container databases from the original. By default, two shards are processed per visit; this number may be configured by the cleave_batch_size option.

The container-sharder daemon periodically writes progress data for containers that are being sharded to recon cache. For example:

"sharding_in_progress": {
    "all": [
        {
            "account": "AUTH_test",
            "active": 0,
            "cleaved": 2,
            "container": "c1",
            "created": 5,
            "db_state": "sharding",
            "error": null,
            "file_size": 26624,
            "found": 0,
            "meta_timestamp": "1525349617.46235",
            "node_index": 1,
            "object_count": 3349030,
            "path": <path_to_db>,
            "root": "AUTH_test/c1",
            "state": "sharding"
        }
    ]
}

This example indicates that from a total of 7 shard ranges, 2 have been cleaved whereas 5 remain in created state waiting to be cleaved.

Shard containers are created in an internal account and not visible to clients. By default, shard containers for an account AUTH_test are created in the internal account .shards_AUTH_test.

Once a container has started sharding, object updates to that container may be redirected to the shard container. The container-sharder daemon is also responsible for sending updates of a shard's object count and bytes_used to the original container so that aggegrate object count and bytes used values can be returned in responses to client requests.

Note

The container-sharder daemon must continue to run on all container servers in order for shards object stats updates to be generated.

Under the hood

Terminology

Name Description

Root container

The original container that lives in the user's account. It holds references to its shard containers.

Retiring DB The original database file that is to be sharded.

Fresh DB

A database file that will replace the retiring database.

Epoch

A timestamp at which the fresh DB is created; the epoch value is embedded in the fresh DB filename.

Shard range

A range of the object namespace defined by a lower bound and upper bound.

Shard container

A container that holds object records for a shard range. Shard containers exist in a hidden account mirroring the user's account.

Parent container

The container from which a shard container has been cleaved. When first sharding a root container each shard's parent container will be the root container. When sharding a shard container each shard's parent container will be the sharding shard container.

Misplaced objects

Items that don't belong in a container's shard range. These will be moved to their correct location by the container-sharder.

Cleaving

The act of moving object records within a shard range to a shard container database.

Shrinking

The act of merging a small shard container into another shard container in order to delete the small shard container.

Donor The shard range that is shrinking away.
Acceptor The shard range into which a donor is merged.

Finding shard ranges

The end goal of sharding a container is to replace the original container database which has grown very large with a number of shard container databases, each of which is responsible for storing a range of the entire object namespace. The first step towards achieving this is to identify an appropriate set of contiguous object namespaces, known as shard ranges, each of which contains a similar sized portion of the container's current object content.

Shard ranges cannot simply be selected by sharding the namespace uniformly, because object names are not guaranteed to be distributed uniformly. If the container were naively sharded into two shard ranges, one containing all object names up to m and the other containing all object names beyond m, then if all object names actually start with o the outcome would be an extremely unbalanced pair of shard containers.

It is also too simplistic to assume that every container that requires sharding can be sharded into two. This might be the goal in the ideal world, but in practice there will be containers that have grown very large and should be sharded into many shards. Furthermore, the time required to find the exact mid-point of the existing object names in a large SQLite database would increase with container size.

For these reasons, shard ranges of size N are found by searching for the Nth object in the database table, sorted by object name, and then searching for the (2 * N)th object, and so on until all objects have been searched. For a container that has exactly 2N objects, the end result is the same as sharding the container at the midpoint of its object names. In practice sharding would typically be enabled for containers with great than 2N objects and more than two shard ranges will be found, the last one probably containing less than N objects. With containers having large multiples of N objects, shard ranges can be identified in batches which enables more scalable solution.

To illustrate this process, consider a very large container in a user account acct that is a candidate for sharding:

image

The swift-manage-shard-ranges tool find sub-command searches the object table for the Nth object whose name will become the upper bound of the first shard range, and the lower bound of the second shard range. The lower bound of the first shard range is the empty string.

For the purposes of this example the first upper bound is `cat`:

image

swift-manage-shard-ranges continues to search the container to find further shard ranges, with the final upper bound also being the empty string.

Enabling sharding

Once shard ranges have been found the swift-manage-shard-ranges replace sub-command is used to insert them into the shard_ranges table of the container database. In addition to its lower and upper bounds, each shard range is given a unique name.

The enable sub-command then creates some final state required to initiate sharding the container, including a special shard range record referred to as the container's own_shard_range whose name is equal to the container's path. This is used to keep a record of the object namespace that the container covers, which for user containers is always the entire namespace. Sharding of the container will only begin when its own shard range's state has been set to SHARDING.

The ~swift.common.utils.ShardRange class

The ~swift.common.utils.ShardRange class provides methods for interactng with the attributes and state of a shard range. The class encapsulates the following properties:

  • The name of the shard range which is also the name of the shard container used to hold object records in its namespace.
  • Lower and upper bounds which define the object namespace of the shard range.
  • A deleted flag.
  • A timestamp at which the bounds and deleted flag were last modified.
  • The object stats for the shard range i.e. object count and bytes used.
  • A timestamp at which the object stats were last modified.
  • The state of the shard range, and an epoch, which is the timestamp used in the shard container's database file name.
  • A timestamp at which the state and epoch were last modified.

A shard range progresses through the following states:

  • FOUND: the shard range has been identified in the container that is to be sharded but no resources have been created for it.
  • CREATED: a shard container has been created to store the contents of the shard range.
  • CLEAVED: the sharding container's contents for the shard range have been copied to the shard container from at least one replica of the sharding container.
  • ACTIVE: a sharding container's constituent shard ranges are moved to this state when all shard ranges in the sharding container have been cleaved.
  • SHRINKING: the shard range has been enabled for shrinking; or
  • SHARDING: the shard range has been enabled for sharding into further sub-shards.
  • SHARDED: the shard range has completed sharding or shrinking; the container will typically now have a number of constituent ACTIVE shard ranges.

Note

Shard range state represents the most advanced state of the shard range on any replica of the container. For example, a shard range in CLEAVED state may not have completed cleaving on all replicas but has cleaved on at least one replica.

Fresh and retiring database files

As alluded to earlier, writing to a large container causes increased latency for the container servers. Once sharding has been initiated on a container it is desirable to stop writing to the large database; ultimately it will be unlinked. This is primarily achieved by redirecting object updates to new shard containers as they are created (see redirecting_updates below), but some object updates may still need to be accepted by the root container and other container metadata must still be modifiable.

To render the large retiring database effectively read-only, when the sharder_daemon finds a container with a set of shard range records, including an own_shard_range, it first creates a fresh database file which will ultimately replace the existing retiring database. For a retiring DB whose filename is:

<hash>.db

the fresh database file name is of the form:

<hash>_<epoch>.db

where epoch is a timestamp stored in the container's own_shard_range.

The fresh DB has a copy of the shard ranges table from the retiring DB and all other container metadata apart from the object records. Once a fresh DB file has been created it is used to store any new object updates and no more object records are written to the retiring DB file.

Once the sharding process has completed, the retiring DB file will be unlinked leaving only the fresh DB file in the container's directory. There are therefore three states that the container DB directory may be in during the sharding process: UNSHARDED, SHARDING and SHARDED.

image

If the container ever shrink to the point that is has no shards then the fresh DB starts to store object records, behaving the same as an unsharded container. This is known as the COLLAPSED state.

In summary, the DB states that any container replica may be in are:

  • UNSHARDED - In this state there is just one standard container database. All containers are originally in this state.
  • SHARDING - There are now two databases, the retiring database and a fresh database. The fresh database stores any metadata, container level stats, an object holding table, and a table that stores shard ranges.
  • SHARDED - There is only one database, the fresh database, which has one or more shard ranges in addition to its own shard range. The retiring database has been unlinked.
  • COLLAPSED - There is only one database, the fresh database, which has only its own shard range and store object records.

Note

DB state is unique to each replica of a container and is not necessarily synchronised with shard range state.

Creating shard containers

The sharder_daemon next creates a shard container for each shard range using the shard range name as the name of the shard container:

image

Each shard container has an own_shard_range record which has the lower and upper bounds of the object namespace for which it is responsible, and a reference to the sharding user container, which is referred to as the root_container. Unlike the root_container, the shard container's own_shard_range does not cover the entire namepsace.

A shard range name takes the form <shard_a>/<shard_c> where <shard_a> is a hidden account and <shard_c> is a container name that is derived from the root container.

The account name <shard_a> used for shard containers is formed by prefixing the user account with the string .shards_. This avoids namespace collisions and also keeps all the shard containers out of view from users of the account.

The container name for each shard container has the form:

<root container name>-<hash of parent container>-<timestamp>-<shard index>

where root container name is the name of the user container to which the contents of the shard container belong, parent container is the name of the container from which the shard is being cleaved, timestamp is the time at which the shard range was created and shard index is the position of the shard range in the name-ordered list of shard ranges for the parent container.

When sharding a user container the parent container name will be the same as the root container. However, if a shard container grows to a size that it requires sharding, then the parent container name for its shards will be the name of the sharding shard container.

For example, consider a user container with path AUTH_user/c which is sharded into two shard containers whose name will be:

.shards_AUTH_user/c-<hash(c)>-1234512345.12345-0
.shards_AUTH_user/c-<hash(c)>-1234512345.12345-1

If the first shard container is subsequently sharded into a further two shard containers then they will be named:

.shards_AUTH_user/c-<hash(c-<hash(c)>-1234567890.12345-0)>-1234567890.12345-0
.shards_AUTH_user/c-<hash(c-<hash(c)>-1234567890.12345-0)>-1234567890.12345-1

This naming scheme guarantees that shards, and shards of shards, each have a unique name of bounded length.

Cleaving shard containers

Having created empty shard containers the sharder daemon will proceed to cleave objects from the retiring database to each shard range. Cleaving occurs in batches of two (by default) shard ranges, so if a container has more than two shard ranges then the daemon must visit it multiple times to complete cleaving.

To cleave a shard range the daemon creates a shard database for the shard container on a local device. This device may be one of the shard container's primary nodes but often it will not. Object records from the corresponding shard range namespace are then copied from the retiring DB to this shard DB.

Swift's container replication mechanism is then used to replicate the shard DB to its primary nodes. Checks are made to ensure that the new shard container DB has been replicated to a sufficient number of its primary nodes before it is considered to have been successfully cleaved. By default the daemon requires successful replication of a new shard broker to at least a quorum of the container rings replica count, but this requirement can be tuned using the shard_replication_quorum option.

Once a shard range has been successfully cleaved from a retiring database the daemon transitions its state to CLEAVED. It should be noted that this state transition occurs as soon as any one of the retiring DB replicas has cleaved the shard range, and therefore does not imply that all retiring DB replicas have cleaved that range. The significance of the state transition is that the shard container is now considered suitable for contributing to object listings, since its contents are present on a quorum of its primary nodes and are the same as at least one of the retiring DBs for that namespace.

Once a shard range is in the CLEAVED state, the requirement for 'successful' cleaving of other instances of the retirng DB may optionally be relaxed since it is not so imperative that their contents are replicated immediately to their primary nodes. The existing_shard_replication_quorum option can be used to reduce the quorum required for a cleaved shard range to be considered successfully replicated by the sharder daemon.

Note

Once cleaved, shard container DBs will continue to be replicated by the normal container-replicator daemon so that they will eventually be fully replicated to all primary nodes regardless of any replication quorum options used by the sharder daemon.

The cleaving progress of each replica of a retiring DB must be tracked independently of the shard range state. This is done using a per-DB CleavingContext object that maintains a cleaving cursor for the retiring DB that it is associated with. The cleaving cursor is simply the upper bound of the last shard range to have been cleaved from that particular retiring DB.

Each CleavingContext is stored in the sharding container's sysmeta under a key that is the id of the retiring DB. Since all container DB files have a unique id, this guarantees that each retiring DB will have a unique CleavingContext. Furthermore, if the retiring DB file is changed, for example by an rsync_then_merge replication operation which might change the contents of the DB's object table, then it will get a new unique CleavingContext.

A CleavingContext maintains other state that is used to ensure that a retiring DB is only considered to be fully cleaved, and ready to be deleted, if all of its object rows have been cleaved to a shard range.

Once all shard ranges have been cleaved from the retiring DB it is deleted. The container is now represented by the fresh DB which has a table of shard range records that point to the shard containers that store the container's object records.

Redirecting object updates

Once a shard container exists, object updates arising from new client requests and async pending files are directed to the shard container instead of the root container. This takes load off of the root container.

For a sharded (or partially sharded) container, when the proxy receives a new object request it issues a GET request to the container for data describing a shard container to which the object update should be sent. The proxy then annotates the object request with the shard container location so that the object server will forward object updates to the shard container. If those updates fail then the async pending file that is written on the object server contains the shard container location.

When the object updater processes async pending files for previously failed object updates, it may not find a shard container location. In this case the updater sends the update to the root container, which returns a redirection response with the shard container location.

Note

Object updates are directed to shard containers as soon as they exist, even if the retiring DB object records have not yet been cleaved to the shard container. This prevents further writes to the retiring DB and also avoids the fresh DB being polluted by new object updates. The goal is to ultimately have all object records in the shard containers and none in the root container.

Building container listings

Listing requests for a sharded container are handled by querying the shard containers for components of the listing. The proxy forwards the client listing request to the root container, as it would for an unsharded container, but the container server responds with a list of shard ranges rather than objects. The proxy then queries each shard container in namespace order for their listing, until either the listing length limit is reached or all shard ranges have been listed.

While a container is still in the process of sharding, only cleaved shard ranges are used when building a container listing. Shard ranges that have not yet cleaved will not have any object records from the root container. The root container continues to provide listings for the uncleaved part of its namespace.

Note

New object updates are redirected to shard containers that have not yet been cleaved. These updates will not therefore be included in container listings until their shard range has been cleaved.

Example request redirection

As an example, consider a sharding container in which 3 shard ranges have been found ending in cat, giraffe and igloo. Their respective shard containers have been created so update requests for objects up to "igloo" are redirected to the appropriate shard container. The root DB continues to handle listing requests and update requests for any object name beyond "igloo".

image

The sharder daemon cleaves objects from the retiring DB to the shard range DBs; it also moves any misplaced objects from the root container's fresh DB to the shard DB. Cleaving progress is represented by the blue line. Once the first shard range has been cleaved listing requests for that namespace are directed to the shard container. The root container still provides listings for the remainder of the namespace.

image

The process continues: the sharder cleaves the next range and a new range is found with upper bound of "linux". Now the root container only needs to handle listing requests up to "giraffe" and update requests for objects whose name is greater than "linux". Load will continue to diminish on the root DB and be dispersed across the shard DBs.

image

Container replication

Shard range records are replicated between container DB replicas in much the same way as object records are for unsharded containers. However, the usual replication of object records between replicas of a container is halted as soon as a container is capable of being sharded. Instead, object records are moved to their new locations in shard containers. This avoids unnecessary replication traffic between container replicas.

To facilitate this, shard ranges are both 'pushed' and 'pulled' during replication, prior to any attempt to replicate objects. This means that the node initiating replication learns about shard ranges from the destination node early during the replication process and is able to skip object replication if it discovers that it has shard ranges and is able to shard.

Note

When the destination DB for container replication is missing then the 'complete_rsync' replication mechanism is still used and in this case only both object records and shard range records are copied to the destination node.

Container deletion

Sharded containers may be deleted by a DELETE request just like an unsharded container. A sharded container must be empty before it can be deleted which implies that all of its shard containers must have reported that they are empty.

Shard containers are not immediately deleted when their root container is deleted; the shard containers remain undeleted so that they are able to continue to receive object updates that might arrive after the root container has been deleted. Shard containers continue to update their deleted root container with their object stats. If a shard container does receive object updates that cause it to no longer be empty then the root container will no longer be considered deleted once that shard container sends an object stats update.

Sharding a shard container

A shard container may grow to a size that requires it to be sharded. swift-manage-shard-ranges may be used to identify shard ranges within a shard container and enable sharding in the same way as for a root container. When a shard is sharding it notifies the root container of its shard ranges so that the root container can start to redirect object updates to the new 'sub-shards'. When the shard has completed sharding the root is aware of all the new sub-shards and the sharding shard deletes its shard range record in the root container shard ranges table. At this point the root container is aware of all the new sub-shards which collectively cover the namespace of the now-deleted shard.

There is no hierarchy of shards beyond the root container and its immediate shards. When a shard shards, its sub-shards are effectively re-parented with the root container.

Shrinking a shard container

A shard container's contents may reduce to a point where the shard container is no longer required. If this happens then the shard container may be shrunk into another shard range. Shrinking is achieved in a similar way to sharding: an 'acceptor' shard range is written to the shrinking shard container's shard ranges table; unlike sharding, where shard ranges each cover a subset of the sharding container's namespace, the acceptor shard range is a superset of the shrinking shard range.

Once given an acceptor shard range the shrinking shard will cleave itself to its acceptor, and then delete itself from the root container shard ranges table.