Alistair Coles 2a593174a5 sharder: avoid small tail shards

A container is typically sharded when it has grown to have an object
count of shard_container_threshold + N, where N <<
shard_container_threshold.  If sharded using the default
rows_per_shard of shard_container_threshold / 2 then this would
previously result in 3 shards: the tail shard would typically be
small, having only N rows. This behaviour caused more shards to be
generated than desirable.

This patch adds a minimum-shard-size option to
swift-manage-shard-ranges, and a corresponding option in the sharder
config, which can be used to avoid small tail shards. If set to
greater than one then the final shard range may be extended to more
than rows_per_shard in order to avoid a further shard range with less
than minimum-shard-size rows. In the example given, if
minimum-shard-size is set to M > N then the container would shard into
two shards having rows_per_shard rows and rows_per_shard + N
respectively.

The default value for minimum-shard-size is rows_per_shard // 5. If
all options have their default values this results in
minimum-shard-size being 100000.

Closes-Bug: #1928370
Co-Authored-By: Matthew Oliver <matt@oliver.net.au>
Change-Id: I3baa278c6eaf488e3f390a936eebbec13f2c3e55

2021-07-07 13:59:36 +01:00

42 KiB

Raw Blame History

Container Server Configuration

This document describes the configuration options available for the container server. Documentation for other swift configuration options can be found at index.

An example Container Server configuration can be found at etc/container-server.conf-sample in the source code repository.

The following configuration sections are available:

[DEFAULT] <container_server_default_options>
[container-server]
[container-replicator]
[container-sharder]
[container-updater]
[container-auditor]

[DEFAULT]

Option	Default	Description
swift_dir	/etc/swift	Swift configuration directory
devices	/srv/node	Parent directory of where devices are mounted
mount_check	true	Whether or not check if the devices are mounted to prevent accidentally writing to the root device
bind_ip	0.0.0.0	IP Address for server to bind to
bind_port	6201	Port for server to bind to
keep_idle	600	Value to set for socket TCP_KEEPIDLE
bind_timeout	30	Seconds to attempt bind before giving up
backlog	4096	Maximum number of allowed pending connections
workers	auto	Override the number of pre-forked workers that will accept connections. If set it should be an integer, zero means no fork. If unset, it will try to default to the number of effective cpu cores and fallback to one. Increasing the number of workers may reduce the possibility of slow file system operations in one request from negatively impacting other requests. See `general-service-tuning`.
max_clients	1024	Maximum number of clients one worker can process simultaneously (it will actually accept(2) N + 1). Setting this to one (1) will only handle one request at a time, without accepting another request concurrently.
user	swift	User to run as
disable_fallocate	false	Disable "fast fail" fallocate checks if the underlying filesystem does not support it.
log_name	swift	Label used when logging
log_facility	LOG_LOCAL0	Syslog log facility
log_level	INFO	Logging level
log_address	/dev/log	Logging directory
log_max_line_length	0	Caps the length of log lines to the value given; no limit if set to 0, the default.
log_custom_handlers log_udp_host	None	Comma-separated list of functions to call to setup custom log handlers. Override log_address
log_udp_port	514	UDP log port
log_statsd_host log_statsd_port log_statsd_default_sample_rate log_statsd_sample_rate_factor log_statsd_metric_prefix	None 8125 1.0 1.0	Enables StatsD logging; IPv4/IPv6 address or a hostname. If a hostname resolves to an IPv4 and IPv6 address, the IPv4 address will be used.
eventlet_debug	false	If true, turn on debug logging for eventlet
fallocate_reserve	1%	You can set fallocate_reserve to the number of bytes or percentage of disk space you'd like fallocate to reserve, whether there is space for the given file size or not. Percentage will be used if the value ends with a '%'. This is useful for systems that behave badly when they completely run out of space; you can make the services pretend they're out of space early.
db_preallocation	off	If you don't mind the extra disk space usage in overhead, you can turn this on to preallocate disk space with SQLite databases to decrease fragmentation.
nice_priority	None	Scheduling priority of server processes. Niceness values range from -20 (most favorable to the process) to 19 (least favorable to the process). The default does not modify priority.
ionice_class	None	I/O scheduling class of server processes. I/O niceness class values are IOPRIO_CLASS_RT (realtime), IOPRIO_CLASS_BE (best-effort), and IOPRIO_CLASS_IDLE (idle). The default does not modify class and priority. Linux supports io scheduling priorities and classes since 2.6.13 with the CFQ io scheduler. Work only with ionice_priority.
ionice_priority	None	I/O scheduling priority of server processes. I/O niceness priority is a number which goes from 0 to 7. The higher the value, the lower the I/O priority of the process. Work only with ionice_class. Ignored if IOPRIO_CLASS_IDLE is set.

[container-server]

Option	Default	Description
use		paste.deploy entry point for the container server. For most cases, this should be `egg:swift#container`.
set log_name	container-server	Label used when logging
set log_facility	LOG_LOCAL0	Syslog log facility
set log_level	INFO	Logging level
set log_requests	True	Whether or not to log each request
set log_address	/dev/log	Logging directory
node_timeout	3	Request timeout to external services
conn_timeout	0.5	Connection timeout to external services
allow_versions replication_server	false	Enable/Disable object versioning feature Configure parameter for creating specific server. To handle all verbs, including replication verbs, do not specify "replication_server" (this is the default). To only handle replication, set to a True value (e.g. "True" or "1"). To handle only non-replication verbs, set to "False". Unless you have a separate replication network, you should not specify any value for "replication_server".
nice_priority	None	Scheduling priority of server processes. Niceness values range from -20 (most favorable to the process) to 19 (least favorable to the process). The default does not modify priority.
ionice_class	None	I/O scheduling class of server processes. I/O niceness class values are IOPRIO_CLASS_RT (realtime), IOPRIO_CLASS_BE (best-effort), and IOPRIO_CLASS_IDLE (idle). The default does not modify class and priority. Linux supports io scheduling priorities and classes since 2.6.13 with the CFQ io scheduler. Work only with ionice_priority.
ionice_priority	None	I/O scheduling priority of server processes. I/O niceness priority is a number which goes from 0 to 7. The higher the value, the lower the I/O priority of the process. Work only with ionice_class. Ignored if IOPRIO_CLASS_IDLE is set.

[container-replicator]

Option	Default	Description
log_name	container-replicator	Label used when logging
log_facility	LOG_LOCAL0	Syslog log facility
log_level	INFO	Logging level
log_address	/dev/log	Logging directory
per_diff	1000	Maximum number of database rows that will be sync'd in a single HTTP replication request. Databases with less than or equal to this number of differing rows will always be sync'd using an HTTP replication request rather than using rsync.
max_diffs	100	Maximum number of HTTP replication requests attempted on each replication pass for any one container. This caps how long the replicator will spend trying to sync a given database per pass so the other databases don't get starved.
concurrency	8	Number of replication workers to spawn
interval	30	Time in seconds to wait between replication passes
databases_per_second	50	Maximum databases to process per second. Should be tuned according to individual system specs. 0 is unlimited.
node_timeout	10	Request timeout to external services
conn_timeout	0.5	Connection timeout to external services
reclaim_age	604800	Time elapsed in seconds before a container can be reclaimed
rsync_module	{replication_ip}::container	Format of the rsync module where the replicator will send data. The configuration value can include some variables that will be extracted from the ring. Variables must follow the format {NAME} where NAME is one of: ip, port, replication_ip, replication_port, region, zone, device, meta. See etc/rsyncd.conf-sample for some examples.
rsync_compress	no	Allow rsync to compress data which is transmitted to destination node during sync. However, this is applicable only when destination node is in a different region than the local one. NOTE: Objects that are already compressed (for example: .tar.gz, mp3) might slow down the syncing process.
recon_cache_path	/var/cache/swift	Path to recon cache
nice_priority	None	Scheduling priority of server processes. Niceness values range from -20 (most favorable to the process) to 19 (least favorable to the process). The default does not modify priority.
ionice_class	None	I/O scheduling class of server processes. I/O niceness class values are IOPRIO_CLASS_RT (realtime), IOPRIO_CLASS_BE (best-effort), and IOPRIO_CLASS_IDLE (idle). The default does not modify class and priority. Linux supports io scheduling priorities and classes since 2.6.13 with the CFQ io scheduler. Work only with ionice_priority.
ionice_priority	None	I/O scheduling priority of server processes. I/O niceness priority is a number which goes from 0 to 7. The higher the value, the lower the I/O priority of the process. Work only with ionice_class. Ignored if IOPRIO_CLASS_IDLE is set.

[container-sharder]

The container-sharder re-uses features of the container-replicator and inherits the following configuration options defined for the [container-replicator]:

interval
databases_per_second
per_diff
max_diffs
concurrency
node_timeout
conn_timeout
reclaim_age
rsync_compress
rsync_module
recon_cache_path

Some config options in this section may also be used by the swift-manage-shard-ranges CLI tool <swift-manage-shard-ranges>.

Option	Default	Description
log_name	container-sharder	Label used when logging
log_facility	LOG_LOCAL0	Syslog log facility
log_level	INFO	Logging level
log_address	/dev/log	Logging directory
auto_shard	false	If the auto_shard option is true then the sharder will automatically select containers to shard, scan for shard ranges, and select shards to shrink. Warning: auto-sharding is still under development and should not be used in production; do not set this option to true in a production cluster.
shard_container_threshold	1000000	This defines the object count at which a container with container-sharding enabled will start to shard. This also indirectly determines the the defaults for rows_per_shard, shrink_threshold and expansion_limit.
rows_per_shard	500000	This defines the initial nominal size of shard containers. The default is shard_container_threshold // 2.
minimum_shard_size shrink_threshold	100000	Minimum size of the final shard range. If this is greater than one then the final shard range may be extended to more than rows_per_shard in order to avoid a further shard range with less than minimum_shard_size rows. The default value is rows_per_shard // 5. This defines the object count below which a 'donor' shard container will be considered for shrinking into another 'acceptor' shard container. The default is determined by shard_shrink_point. If set, shrink_threshold will take precedence over shard_shrink_point.
shard_shrink_point expansion_limit	10	Deprecated: shrink_threshold is recommended and if set will take precedence over shard_shrink_point. This defines the object count below which a 'donor' shard container will be considered for shrinking into another 'acceptor' shard container. shard_shrink_point is a percentage of shard_container_threshold e.g. the default value of 10 means 10% of the shard_container_threshold. This defines the maximum allowed size of an acceptor shard container after having a donor merged into it. The default is determined by shard_shrink_merge_point. If set, expansion_limit will take precedence over shard_shrink_merge_point.
shard_shrink_merge_point	75	Deprecated: expansion_limit is recommended and if set will take precedence over shard_shrink_merge_point. This defines the maximum allowed size of an acceptor shard container after having a donor merged into it. Shard_shrink_merge_point is a percentage of shard_container_threshold. e.g. the default value of 75 means that the projected sum of a donor object count and acceptor count must be less than 75% of shard_container_threshold for the donor to be allowed to merge into the acceptor. For example, if shard_container_threshold is 1 million, shard_shrink_point is 10, and shard_shrink_merge_point is 75 then a shard will be considered for shrinking if it has less than or equal to 100 thousand objects but will only merge into an acceptor if the combined object count would be less than or equal to 750 thousand objects.
shard_scanner_batch_size	10	When auto-sharding is enabled this defines the maximum number of shard ranges that will be found each time the sharder daemon visits a sharding container. If necessary the sharder daemon will continue to search for more shard ranges each time it visits the container.
cleave_batch_size	2	Defines the number of shard ranges that will be cleaved each time the sharder daemon visits a sharding container.
cleave_row_batch_size	10000	Defines the size of batches of object rows read from a sharding container and merged to a shard container during cleaving.
shard_replication_quorum	auto	Defines the number of successfully replicated shard dbs required when cleaving a previously uncleaved shard range before the sharder will progress to the next shard range. The value should be less than or equal to the container ring replica count. The default of 'auto' causes the container ring quorum value to be used. This option only applies to the container-sharder replication and does not affect the number of shard container replicas that will eventually be replicated by the container-replicator.
existing_shard_replication_quorum	auto	Defines the number of successfully replicated shard dbs required when cleaving a shard range that has been previously cleaved on another node before the sharder will progress to the next shard range. The value should be less than or equal to the container ring replica count. The default of 'auto' causes the shard_replication_quorum value to be used. This option only applies to the container-sharder replication and does not affect the number of shard container replicas that will eventually be replicated by the container-replicator.
internal_client_conf_path	see description	The sharder uses an internal client to create and make requests to containers. The absolute path to the client config file can be configured. Defaults to /etc/swift/internal-client.conf
request_tries	3	The number of time the internal client will retry requests.
recon_candidates_limit	5	Each time the sharder dumps stats to the recon cache file it includes a list of containers that appear to need sharding but are not yet sharding. By default this list is limited to the top 5 containers, ordered by object count. The limit may be changed by setting recon_candidates_limit to an integer value. A negative value implies no limit.
broker_timeout	60	Large databases tend to take a while to work with, but we want to make sure we write down our progress. Use a larger-than-normal broker timeout to make us less likely to bomb out on a LockTimeout.

[container-updater]

Option	Default	Description
log_name	container-updater	Label used when logging
log_facility	LOG_LOCAL0	Syslog log facility
log_level	INFO	Logging level
log_address	/dev/log	Logging directory
interval	300	Minimum time for a pass to take
concurrency	4	Number of updater workers to spawn
node_timeout	3	Request timeout to external services
conn_timeout	0.5	Connection timeout to external services
containers_per_second	50	Maximum containers updated per second. Should be tuned according to individual system specs. 0 is unlimited.
slowdown	0.01	Time in seconds to wait between containers. Deprecated in favor of containers_per_second.
account_suppression_time	60	Seconds to suppress updating an account that has generated an error (timeout, not yet found, etc.)
recon_cache_path	/var/cache/swift	Path to recon cache
nice_priority	None	Scheduling priority of server processes. Niceness values range from -20 (most favorable to the process) to 19 (least favorable to the process). The default does not modify priority.
ionice_class	None	I/O scheduling class of server processes. I/O niceness class values are IOPRIO_CLASS_RT (realtime), IOPRIO_CLASS_BE (best-effort), and IOPRIO_CLASS_IDLE (idle). The default does not modify class and priority. Linux supports io scheduling priorities and classes since 2.6.13 with the CFQ io scheduler. Work only with ionice_priority.
ionice_priority	None	I/O scheduling priority of server processes. I/O niceness priority is a number which goes from 0 to 7. The higher the value, the lower the I/O priority of the process. Work only with ionice_class. Ignored if IOPRIO_CLASS_IDLE is set.

[container-auditor]

Option	Default	Description
log_name	container-auditor	Label used when logging
log_facility	LOG_LOCAL0	Syslog log facility
log_level	INFO	Logging level
log_address	/dev/log	Logging directory
interval	1800	Minimum time for a pass to take
containers_per_second	200	Maximum containers audited per second. Should be tuned according to individual system specs. 0 is unlimited.
recon_cache_path	/var/cache/swift	Path to recon cache
nice_priority	None	Scheduling priority of server processes. Niceness values range from -20 (most favorable to the process) to 19 (least favorable to the process). The default does not modify priority.
ionice_class	None	I/O scheduling class of server processes. I/O niceness class values are IOPRIO_CLASS_RT (realtime), IOPRIO_CLASS_BE (best-effort), and IOPRIO_CLASS_IDLE (idle). The default does not modify class and priority. Linux supports io scheduling priorities and classes since 2.6.13 with the CFQ io scheduler. Work only with ionice_priority.
ionice_priority	None	I/O scheduling priority of server processes. I/O niceness priority is a number which goes from 0 to 7. The higher the value, the lower the I/O priority of the process. Work only with ionice_class. Ignored if IOPRIO_CLASS_IDLE is set.

42 KiB Raw Blame History