Darrell Bishop af2ff124eb Update docs for new ring serialization.

The Admin Guide now contains information about the ring serialization
change (and importantly, how to downgrade, if necessary).

Also added container-server conf var, "allow_versions" to the Deployment
Guide.

Also changed description of proxy-server conf var,
"max_containers_whitelist" to say it contains "account names" not
"account hashes".

Change-Id: Ib23c6118cc5195cc04765afd28e442e4c735f0d4

2012-08-21 12:09:28 -07:00

39 KiB

Raw Blame History

Deployment Guide

Hardware Considerations

Swift is designed to run on commodity hardware. At Rackspace, our storage servers are currently running fairly generic 4U servers with 24 2T SATA drives and 8 cores of processing power. RAID on the storage drives is not required and not recommended. Swift's disk usage pattern is the worst case possible for RAID, and performance degrades very quickly using RAID 5 or 6.

Deployment Options

The swift services run completely autonomously, which provides for a lot of flexibility when architecting the hardware deployment for swift. The 4 main services are:

Proxy Services
Object Services
Container Services
Account Services

The Proxy Services are more CPU and network I/O intensive. If you are using 10g networking to the proxy, or are terminating SSL traffic at the proxy, greater CPU power will be required.

The Object, Container, and Account Services (Storage Services) are more disk and network I/O intensive.

The easiest deployment is to install all services on each server. There is nothing wrong with doing this, as it scales each service out horizontally.

At Rackspace, we put the Proxy Services on their own servers and all of the Storage Services on the same server. This allows us to send 10g networking to the proxy and 1g to the storage servers, and keep load balancing to the proxies more manageable. Storage Services scale out horizontally as storage servers are added, and we can scale overall API throughput by adding more Proxies.

If you need more throughput to either Account or Container Services, they may each be deployed to their own servers. For example you might use faster (but more expensive) SAS or even SSD drives to get faster disk I/O to the databases.

Load balancing and network design is left as an exercise to the reader, but this is a very important part of the cluster, so time should be spent designing the network for a Swift cluster.

Preparing the Ring

The first step is to determine the number of partitions that will be in the ring. We recommend that there be a minimum of 100 partitions per drive to insure even distribution across the drives. A good starting point might be to figure out the maximum number of drives the cluster will contain, and then multiply by 100, and then round up to the nearest power of two.

For example, imagine we are building a cluster that will have no more than 5,000 drives. That would mean that we would have a total number of 500,000 partitions, which is pretty close to 2^19, rounded up.

It is also a good idea to keep the number of partitions small (relatively). The more partitions there are, the more work that has to be done by the replicators and other backend jobs and the more memory the rings consume in process. The goal is to find a good balance between small rings and maximum cluster size.

The next step is to determine the number of replicas to store of the data. Currently it is recommended to use 3 (as this is the only value that has been tested). The higher the number, the more storage that is used but the less likely you are to lose data.

It is also important to determine how many zones the cluster should have. It is recommended to start with a minimum of 5 zones. You can start with fewer, but our testing has shown that having at least five zones is optimal when failures occur. We also recommend trying to configure the zones at as high a level as possible to create as much isolation as possible. Some example things to take into consideration can include physical location, power availability, and network connectivity. For example, in a small cluster you might decide to split the zones up by cabinet, with each cabinet having its own power and network connectivity. The zone concept is very abstract, so feel free to use it in whatever way best isolates your data from failure. Zones are referenced by number, beginning with 1.

You can now start building the ring with:

swift-ring-builder <builder_file> create <part_power> <replicas> <min_part_hours>

This will start the ring build process creating the <builder_file> with 2^<part_power> partitions. <min_part_hours> is the time in hours before a specific partition can be moved in succession (24 is a good value for this).

Devices can be added to the ring with:

swift-ring-builder <builder_file> add z<zone>-<ip>:<port>/<device_name>_<meta> <weight>

This will add a device to the ring where <builder_file> is the name of the builder file that was created previously, <zone> is the number of the zone this device is in, <ip> is the ip address of the server the device is in, <port> is the port number that the server is running on, <device_name> is the name of the device on the server (for example: sdb1), <meta> is a string of metadata for the device (optional), and <weight> is a float weight that determines how many partitions are put on the device relative to the rest of the devices in the cluster (a good starting point is 100.0 x TB on the drive). Add each device that will be initially in the cluster.

Once all of the devices are added to the ring, run:

swift-ring-builder <builder_file> rebalance

This will distribute the partitions across the drives in the ring. It is important whenever making changes to the ring to make all the changes required before running rebalance. This will ensure that the ring stays as balanced as possible, and as few partitions are moved as possible.

The above process should be done to make a ring for each storage service (Account, Container and Object). The builder files will be needed in future changes to the ring, so it is very important that these be kept and backed up. The resulting .tar.gz ring file should be pushed to all of the servers in the cluster. For more information about building rings, running swift-ring-builder with no options will display help text with available commands and options. More information on how the ring works internally can be found in the Ring Overview <overview_ring>.

General Server Configuration

Swift uses paste.deploy (http://pythonpaste.org/deploy/) to manage server configurations. Default configuration options are set in the [DEFAULT] section, and any options specified there can be overridden in any of the other sections BUT ONLY BY USING THE SYNTAX set option_name = value. This is the unfortunate way paste.deploy works and I'll try to explain it in full.

First, here's an example paste.deploy configuration file:

[DEFAULT]
name1 = globalvalue
name2 = globalvalue
name3 = globalvalue
set name4 = globalvalue

[pipeline:main]
pipeline = myapp

[app:myapp]
use = egg:mypkg#myapp
name2 = localvalue
set name3 = localvalue
set name5 = localvalue
name6 = localvalue

The resulting configuration that myapp receives is:

global {'__file__': '/etc/mypkg/wsgi.conf', 'here': '/etc/mypkg',
        'name1': 'globalvalue',
        'name2': 'globalvalue',
        'name3': 'localvalue',
        'name4': 'globalvalue',
        'name5': 'localvalue',
        'set name4': 'globalvalue'}
local {'name6': 'localvalue'}

So, name1 got the global value which is fine since it's only in the DEFAULT section anyway.

name2 got the global value from DEFAULT even though it appears to be overridden in the app:myapp subsection. This is just the unfortunate way paste.deploy works (at least at the time of this writing.)

name3 got the local value from the app:myapp subsection because it is using the special paste.deploy syntax of set option_name = value. So, if you want a default value for most app/filters but want to overridde it in one subsection, this is how you do it.

name4 got the global value from DEFAULT since it's only in that section anyway. But, since we used the set syntax in the DEFAULT section even though we shouldn't, notice we also got a set name4 variable. Weird, but probably not harmful.

name5 got the local value from the app:myapp subsection since it's only there anyway, but notice that it is in the global configuration and not the local configuration. This is because we used the set syntax to set the value. Again, weird, but not harmful since Swift just treats the two sets of configuration values as one set anyway.

name6 got the local value from app:myapp subsection since it's only there, and since we didn't use the set syntax, it's only in the local configuration and not the global one. Though, as indicated above, there is no special distinction with Swift.

That's quite an explanation for something that should be so much simpler, but it might be important to know how paste.deploy interprets configuration files. The main rule to remember when working with Swift configuration files is:

Note

Use the set option_name = value syntax in subsections if the option is also set in the [DEFAULT] section. Don't get in the habit of always using the set syntax or you'll probably mess up your non-paste.deploy configuration files.

Object Server Configuration

An Example Object Server configuration can be found at etc/object-server.conf-sample in the source code repository.

The following configuration options are available:

[DEFAULT]

Option	Default	Description
swift_dir	/etc/swift	Swift configuration directory
devices	/srv/node	Parent directory of where devices are mounted
mount_check	true	Whether or not check if the devices are mounted to prevent accidentally writing to the root device
bind_ip	0.0.0.0	IP Address for server to bind to
bind_port	6000	Port for server to bind to
workers	1	Number of workers to fork

[object-server]

Option	Default	Description
use		paste.deploy entry point for the object server. For most cases, this should be egg:swift#object.
set log_name	object-server	Label used when logging
set log_facility	LOG_LOCAL0	Syslog log facility
set log_level	INFO	Logging level
set log_requests	True	Whether or not to log each request
user	swift	User to run as
node_timeout	3	Request timeout to external services
conn_timeout	0.5	Connection timeout to external services
network_chunk_size	65536	Size of chunks to read/write over the network
disk_chunk_size	65536	Size of chunks to read/write to disk
max_upload_time	86400	Maximum time allowed to upload an object
slow	0	If > 0, Minimum time in seconds for a PUT or DELETE request to complete
mb_per_sync	512	On PUT requests, sync file every n MB
keep_cache_size	5242880	Largest object size to keep in buffer cache
keep_cache_private	false	Allow non-public objects to stay in kernel's buffer cache

[object-replicator]

Option	Default	Description
log_name	object-replicator	Label used when logging
log_facility	LOG_LOCAL0	Syslog log facility
log_level	INFO	Logging level
daemonize	yes	Whether or not to run replication as a daemon
run_pause	30	Time in seconds to wait between replication passes
concurrency	1	Number of replication workers to spawn
timeout	5	Timeout value sent to rsync --timeout and --contimeout options
stats_interval	3600	Interval in seconds between logging replication statistics
reclaim_age	604800	Time elapsed in seconds before an object can be reclaimed

[object-updater]

Option	Default	Description
log_name	object-updater	Label used when logging
log_facility	LOG_LOCAL0	Syslog log facility
log_level	INFO	Logging level
interval	300	Minimum time for a pass to take
concurrency	1	Number of updater workers to spawn
node_timeout	10	Request timeout to external services
conn_timeout	0.5	Connection timeout to external services
slowdown	0.01	Time in seconds to wait between objects

[object-auditor]

Option	Default	Description
log_name	object-auditor	Label used when logging
log_facility	LOG_LOCAL0	Syslog log facility
log_level	INFO	Logging level
log_time	3600	Frequency of status logs in seconds.
files_per_second	20	Maximum files audited per second. Should be tuned according to individual system specs. 0 is unlimited.
bytes_per_second	10000000	Maximum bytes audited per second. Should be tuned according to individual system specs. 0 is unlimited.

Container Server Configuration

An example Container Server configuration can be found at etc/container-server.conf-sample in the source code repository.

The following configuration options are available:

[DEFAULT]

Option	Default	Description
swift_dir	/etc/swift	Swift configuration directory
devices	/srv/node	Parent directory of where devices are mounted
mount_check	true	Whether or not check if the devices are mounted to prevent accidentally writing to the root device
bind_ip	0.0.0.0	IP Address for server to bind to
bind_port	6001	Port for server to bind to
workers	1	Number of workers to fork
user	swift	User to run as

[container-server]

Option	Default	Description
use		paste.deploy entry point for the container server. For most cases, this should be egg:swift#container.
set log_name	container-server	Label used when logging
set log_facility	LOG_LOCAL0	Syslog log facility
set log_level	INFO	Logging level
node_timeout	3	Request timeout to external services
conn_timeout	0.5	Connection timeout to external services
allow_versions	false	Enable/Disable object versioning feature

[container-replicator]

Option	Default	Description
log_name	container-replicator	Label used when logging
log_facility	LOG_LOCAL0	Syslog log facility
log_level per_diff	INFO 1000	Logging level
concurrency	8	Number of replication workers to spawn
run_pause	30	Time in seconds to wait between replication passes
node_timeout	10	Request timeout to external services
conn_timeout	0.5	Connection timeout to external services
reclaim_age	604800	Time elapsed in seconds before a container can be reclaimed

[container-updater]

Option	Default	Description
log_name	container-updater	Label used when logging
log_facility	LOG_LOCAL0	Syslog log facility
log_level	INFO	Logging level
interval	300	Minimum time for a pass to take
concurrency	4	Number of updater workers to spawn
node_timeout	3	Request timeout to external services
conn_timeout	0.5	Connection timeout to external services
slowdown	0.01	Time in seconds to wait between containers
account_suppression_time	60	Seconds to suppress updating an account that has generated an error (timeout, not yet found, etc.)

[container-auditor]

Option	Default	Description
log_name	container-auditor	Label used when logging
log_facility	LOG_LOCAL0	Syslog log facility
log_level	INFO	Logging level
interval	1800	Minimum time for a pass to take

Account Server Configuration

An example Account Server configuration can be found at etc/account-server.conf-sample in the source code repository.

The following configuration options are available:

[DEFAULT]

Option	Default	Description
swift_dir	/etc/swift	Swift configuration directory
devices	/srv/node	Parent directory or where devices are mounted
mount_check	true	Whether or not check if the devices are mounted to prevent accidentally writing to the root device
bind_ip	0.0.0.0	IP Address for server to bind to
bind_port	6002	Port for server to bind to
workers	1	Number of workers to fork
user	swift	User to run as
db_preallocation	off	If you don't mind the extra disk space usage in overhead, you can turn this on to preallocate disk space with SQLite databases to decrease fragmentation.

[account-server]

Option	Default	Description
use		Entry point for paste.deploy for the account server. For most cases, this should be egg:swift#account.
set log_name	account-server	Label used when logging
set log_facility	LOG_LOCAL0	Syslog log facility
set log_level	INFO	Logging level

[account-replicator]

Option	Default	Description
log_name	account-replicator	Label used when logging
log_facility	LOG_LOCAL0	Syslog log facility
log_level per_diff	INFO 1000	Logging level
concurrency	8	Number of replication workers to spawn
run_pause	30	Time in seconds to wait between replication passes
node_timeout	10	Request timeout to external services
conn_timeout	0.5	Connection timeout to external services
reclaim_age	604800	Time elapsed in seconds before an account can be reclaimed

[account-auditor]

Option	Default	Description
log_name	account-auditor	Label used when logging
log_facility	LOG_LOCAL0	Syslog log facility
log_level	INFO	Logging level
interval	1800	Minimum time for a pass to take

[account-reaper]

Option	Default	Description
log_name	account-auditor	Label used when logging
log_facility	LOG_LOCAL0	Syslog log facility
log_level	INFO	Logging level
concurrency	25	Number of replication workers to spawn
interval	3600	Minimum time for a pass to take
node_timeout	10	Request timeout to external services
conn_timeout	0.5	Connection timeout to external services
delay_reaping	0	Normally, the reaper begins deleting account information for deleted accounts immediately; you can set this to delay its work however. The value is in seconds, 2592000 = 30 days, for example.

Proxy Server Configuration

An example Proxy Server configuration can be found at etc/proxy-server.conf-sample in the source code repository.

The following configuration options are available:

[DEFAULT]

Option	Default	Description
bind_ip	0.0.0.0	IP Address for server to bind to
bind_port	80	Port for server to bind to
swift_dir	/etc/swift	Swift configuration directory
workers	1	Number of workers to fork
user cert_file key_file	swift	User to run as Path to the ssl .crt. This should be enabled for testing purposes only. Path to the ssl .key. This should be enabled for testing purposes only.

[proxy-server]

Option	Default	Description
use		Entry point for paste.deploy for the proxy server. For most cases, this should be egg:swift#proxy.
set log_name	proxy-server	Label used when logging
set log_facility	LOG_LOCAL0	Syslog log facility
set log_level	INFO	Log level
set log_headers	True	If True, log headers in each request
set log_handoffs	True	If True, the proxy will log whenever it has to failover to a handoff node
recheck_account_existence	60	Cache timeout in seconds to send memcached for account existence
recheck_container_existence	60	Cache timeout in seconds to send memcached for container existence
object_chunk_size	65536	Chunk size to read from object servers
client_chunk_size	65536	Chunk size to read from clients
memcache_servers	127.0.0.1:11211	Comma separated list of memcached servers ip:port
node_timeout	10	Request timeout to external services
client_timeout	60	Timeout to read one chunk from a client
conn_timeout	0.5	Connection timeout to external services
error_suppression_interval	60	Time in seconds that must elapse since the last error for a node to be considered no longer error limited
error_suppression_limit	10	Error count to consider a node error limited
allow_account_management	false	Whether account PUTs and DELETEs are even callable
object_post_as_copy	true	Set object_post_as_copy = false to turn on fast posts where only the metadata changes are stored anew and the original data file is kept in place. This makes for quicker posts; but since the container metadata isn't updated in this mode, features like container sync won't be able to sync posts.
account_autocreate	false	If set to 'true' authorized accounts that do not yet exist within the Swift cluster will be automatically created.
max_containers_per_account max_containers_whitelist	0	If set to a positive value, trying to create a container when the account already has at least this maximum containers will result in a 403 Forbidden. Note: This is a soft limit, meaning a user might exceed the cap for recheck_account_existence before the 403s kick in. This is a comma separated list of account names that ignore the max_containers_per_account cap.
rate_limit_after_segment	10	Rate limit the download of large object segments after this segment is downloaded.
rate_limit_segments_per_sec	1	Rate limit large object downloads at this rate.

[tempauth]

Option	Default	Description
use		Entry point for paste.deploy to use for auth. To use tempauth set to: egg:swift#tempauth
set log_name	tempauth	Label used when logging
set log_facility	LOG_LOCAL0	Syslog log facility
set log_level	INFO	Log level
set log_headers	True	If True, log headers in each request
reseller_prefix	AUTH	The naming scope for the auth service. Swift storage accounts and auth tokens will begin with this prefix.
auth_prefix	/auth/	The HTTP request path prefix for the auth service. Swift itself reserves anything beginning with the letter v.
token_life	86400	The number of seconds a token is valid.

Additionally, you need to list all the accounts/users you want here. The format is:

user_<account>_<user> = <key> [group] [group] [...] [storage_url]

There are special groups of:

.reseller_admin = can do anything to any account for this auth
.admin = can do anything within the account

If neither of these groups are specified, the user can only access containers that have been explicitly allowed for them by a .admin or .reseller_admin.

The trailing optional storage_url allows you to specify an alternate url to hand back to the user upon authentication. If not specified, this defaults to:

http[s]://<ip>:<port>/v1/<reseller_prefix>_<account>

Where http or https depends on whether cert_file is specified in the [DEFAULT] section, <ip> and <port> are based on the [DEFAULT] section's bind_ip and bind_port (falling back to 127.0.0.1 and 8080), <reseller_prefix> is from this section, and <account> is from the user<account>_<user> name.

Here are example entries, required for running the tests:

user_admin_admin = admin .admin .reseller_admin
user_test_tester = testing .admin
user_test2_tester2 = testing2 .admin
user_test_tester3 = testing3

Memcached Considerations

Several of the Services rely on Memcached for caching certain types of lookups, such as auth tokens, and container/account existence. Swift does not do any caching of actual object data. Memcached should be able to run on any servers that have available RAM and CPU. At Rackspace, we run Memcached on the proxy servers. The memcache_servers config option in the proxy-server.conf should contain all memcached servers.

System Time

Time may be relative but it is relatively important for Swift! Swift uses timestamps to determine which is the most recent version of an object. It is very important for the system time on each server in the cluster to by synced as closely as possible (more so for the proxy server, but in general it is a good idea for all the servers). At Rackspace, we use NTP with a local NTP server to ensure that the system times are as close as possible. This should also be monitored to ensure that the times do not vary too much.

General Service Tuning

Most services support either a worker or concurrency value in the settings. This allows the services to make effective use of the cores available. A good starting point to set the concurrency level for the proxy and storage services to 2 times the number of cores available. If more than one service is sharing a server, then some experimentation may be needed to find the best balance.

At Rackspace, our Proxy servers have dual quad core processors, giving us 8 cores. Our testing has shown 16 workers to be a pretty good balance when saturating a 10g network and gives good CPU utilization.

Our Storage servers all run together on the same servers. These servers have dual quad core processors, for 8 cores total. We run the Account, Container, and Object servers with 8 workers each. Most of the background jobs are run at a concurrency of 1, with the exception of the replicators which are run at a concurrency of 2.

The above configuration setting should be taken as suggestions and testing of configuration settings should be done to ensure best utilization of CPU, network connectivity, and disk I/O.

Filesystem Considerations

Swift is designed to be mostly filesystem agnostic--the only requirement being that the filesystem supports extended attributes (xattrs). After thorough testing with our use cases and hardware configurations, XFS was the best all-around choice. If you decide to use a filesystem other than XFS, we highly recommend thorough testing.

If you are using XFS, some settings that can dramatically impact performance. We recommend the following when creating the XFS partition:

mkfs.xfs -i size=1024 -f /dev/sda1

Setting the inode size is important, as XFS stores xattr data in the inode. If the metadata is too large to fit in the inode, a new extent is created, which can cause quite a performance problem. Upping the inode size to 1024 bytes provides enough room to write the default metadata, plus a little headroom. We do not recommend running Swift on RAID, but if you are using RAID it is also important to make sure that the proper sunit and swidth settings get set so that XFS can make most efficient use of the RAID array.

We also recommend the following example mount options when using XFS:

mount -t xfs -o noatime,nodiratime,nobarrier,logbufs=8 /dev/sda1 /srv/node/sda

For a standard swift install, all data drives are mounted directly under /srv/node (as can be seen in the above example of mounting /def/sda1 as /srv/node/sda). If you choose to mount the drives in another directory, be sure to set the devices config option in all of the server configs to point to the correct directory.

General System Tuning

Rackspace currently runs Swift on Ubuntu Server 10.04, and the following changes have been found to be useful for our use cases.

The following settings should be in `/etc/sysctl.conf`:

# disable TIME_WAIT.. wait..
net.ipv4.tcp_tw_recycle=1
net.ipv4.tcp_tw_reuse=1

# disable syn cookies
net.ipv4.tcp_syncookies = 0

# double amount of allowed conntrack
net.ipv4.netfilter.ip_conntrack_max = 262144

To load the updated sysctl settings, run sudo sysctl -p

A note about changing the TIME_WAIT values. By default the OS will hold a port open for 60 seconds to ensure that any remaining packets can be received. During high usage, and with the number of connections that are created, it is easy to run out of ports. We can change this since we are in control of the network. If you are not in control of the network, or do not expect high loads, then you may not want to adjust those values.

Logging Considerations

Swift is set up to log directly to syslog. Every service can be configured with the log_facility option to set the syslog log facility destination. We recommended using syslog-ng to route the logs to specific log files locally on the server and also to remote log collecting servers.

39 KiB Raw Blame History

Deployment Guide

Hardware Considerations

Deployment Options

Preparing the Ring

General Server Configuration

Object Server Configuration

Container Server Configuration

Account Server Configuration

Proxy Server Configuration

Memcached Considerations

System Time

General Service Tuning

Filesystem Considerations

General System Tuning

Logging Considerations

39 KiB

Raw Blame History