Added initial cut of deployment guide and updated auth overview to better reflect the current code, with contributions from John Dickinson, Mike Barton, and Greg Holt.

2010-07-23 22:26:59 +00:00 · 2010-07-23 22:26:59 +00:00 · d6349fda14
commit d6349fda14
parent 9669c4d086 9916adba75
3 changed files with 483 additions and 22 deletions
--- a/doc/source/deployment_guide.rst
+++ b/doc/source/deployment_guide.rst
@ -0,0 +1,450 @@
+================
+Deployment Guide
+================
+
+-----------------------
+Hardware Considerations
+-----------------------
+
+Swift is designed to run on commodity hardware. At Rackspace, our storage
+servers are currently running fairly generic 4U servers with 24 2T SATA
+drives and 8 cores of processing power. RAID on the storage drives is not
+required and not recommended. Swift's disk usage pattern is the worst
+case possible for RAID, and performance degrades very quickly using RAID 5
+or 6.
+
+------------------
+Deployment Options
+------------------
+
+The swift services run completely autonomously, which provides for a lot of
+flexibility when architecting the hardware deployment for swift. The 4 main
+services are:
+
+#. Proxy Services
+#. Object Services
+#. Container Services
+#. Account Services
+
+The Proxy Services are more CPU and network I/O intensive. If you are using
+10g networking to the proxy, or are terminating SSL traffic at the proxy,
+greater CPU power will be required.
+
+The Object, Container, and Account Services (Storage Services) are more disk
+and network I/O intensive.
+
+The easiest deployment is to install all services on each server. There is
+nothing wrong with doing this, as it scales each service out horizontally.
+
+At Rackspace, we put the Proxy Services on their own servers and all of the
+Storage Services on the same server. This allows us to send 10g networking to
+the proxy and 1g to the storage servers, and keep load balancing to the
+proxies more manageable.  Storage Services scale out horizontally as storage
+servers are added, and we can scale overall API throughput by adding more
+Proxies.
+
+If you need more throughput to either Account or Container Services, they may
+each be deployed to their own servers. For example you might use faster (but
+more expensive) SAS or even SSD drives to get faster disk I/O to the databases.
+
+Load balancing and network design is left as an excercise to the reader,
+but this is a very important part of the cluster, so time should be spent
+designing the network for a Swift cluster.
+
+------------------
+Preparing the Ring
+------------------
+
+The first step is to determine the number of partitions that will be in the
+ring. We recommend that there be a minimum of 100 partitions per drive to
+insure even distribution accross the drives. A good starting point might be
+to figure out the maximum number of drives the cluster will contain, and then
+multiply by 100, and then round up to the nearest power of two.
+
+For example, imagine we are building a cluster that will have no more than
+5,000 drives. That would mean that we would have a total number of 500,000
+partitions, which is pretty close to 2^19, rounded up.
+
+It is also a good idea to keep the number of partitions small (realatively).
+The more partitions there are, the more work that has to be done by the
+replicators and other backend jobs and the more memory the rings consume in
+process. The goal is to find a good balance between small rings and maximum
+cluster size.
+
+The next step is to determine the number of replicas to store of the data.
+Currently it is recommended to use 3 (as this is the only value that has
+been tested). The higher the number, the more storage that is used but the
+less likely you are to lose data.
+
+It is also important to determine how many zones the cluster should have. It is
+recommended to start with a minimum of 5 zones. You can start with fewer, but
+our testing has shown that having at least five zones is optimal when failures
+occur. We also recommend trying to configure the zones as high a level as
+possible to create as much isolation as possible. Some example things to take
+into consideration can include physical location, power availability, and
+network connectivity. For example, in a small cluster you might decide to
+split the zones up by cabinet, with each cabinet having its own power and
+network connectivity. The zone concept is very abstract, so feel free to use
+it in whatever way best isolates your data from failure. Zones are referenced
+by number, beginning with 1.
+
+You can now start building the ring with::
+
+    swift-ring_builder <builder_file> create <part_power> <replicas> <min_part_hours>
+
+This will start the ring build process creating the <builder_file> with 
+2^<part_power> partitions. <min_part_hours> is the time in hours before a
+specific partition can be moved in succession (24 is a good value for this).
+
+Devices can be added to the ring with::
+
+    swift-ring_builder <builder_file> add z<zone>-<ip>:<port>/<device_name>_<meta> <weight>
+
+This will add a device to the ring where <builder_file> is the name of the
+builder file that was created previously, <zone> is the number of the zone
+this device is in, <ip> is the ip address of the server the device is in,
+<port> is the port number that the server is running on, <device_name> is
+the name of the device on the server (for example: sdb1), <meta> is a string
+of metadata for the device (optional), and <weight> is a float weight that
+determines how many partitions are put on the device relative to the rest of
+the devices in the cluster (a good starting point is 100.0 x TB on the drive).
+Add each device that will be initially in the cluster.
+
+Once all of the devices are added to the ring, run::
+
+    swift_ring_builder <builder_file> rebalance
+
+This will distribute the partitions across the drives in the ring. It is
+important whenever making changes to the ring to make all the changes
+required before running rebalance. This will ensure that the ring stays as
+balanced as possible, and as few partitions are moved as possible.
+
+The above process should be done to make a ring for each storage serivce
+(Account, Container and Object). The builder files will be needed in future
+changes to the ring, so it is very important that these be kept and backed up.
+The resulting .tar.gz ring file should be pushed to all of the servers in the
+cluster. For more information about building rings, running
+swift_ring_builder with no options will display help text with available
+commands and options. More information on how the ring works internally
+can be found in the :doc:`Ring Overview <overview_ring>`.
+
+---------------------------
+Object Server Configuration
+---------------------------
+
+An Example Object Server configuration can be found at 
+etc/object-server.conf-sample in the source code repository.
+
+The following configuration options are available:
+
+[object-server]
+
+==================  ==========  =============================================
+Option              Default     Description
+------------------  ----------  ---------------------------------------------
+swift_dir           /etc/swift  Swift configuration directory
+devices             /srv/node   Parent directory of where devices are mounted
+mount_check         true        Weather or not check if the devices are
+                                mounted to prevent accidently writing
+                                to the root device
+bind_ip             0.0.0.0     IP Address for server to bind to
+bind_port           6000        Port for server to bind to
+workers             1           Number of workers to fork
+log_facility        LOG_LOCAL0  Syslog log facility
+log_level           INFO        Logging level
+log_requests        True        Weather or not to log each request
+user                swift       User to run as
+node_timeout        3           Request timeout to external services
+conn_timeout        0.5         Connection timeout to external services
+network_chunk_size  65536       Size of chunks to read/write over the
+                                network
+disk_chunk_size     65536       Size of chunks to read/write to disk
+max_upload_time     86400       Maximum time allowed to upload an object
+slow                0           If > 0, Minimum time in seconds for a PUT
+                                or DELETE request to complete
+==================  ==========  =============================================
+
+[object-replicator]
+
+==================  ==========  ===========================================
+Option              Default     Description
+------------------  ----------  -------------------------------------------
+log_facility        LOG_LOCAL0  Syslog log facility
+log_level           INFO        Logging level
+daemonize           yes         Weather or not to run replication as a
+                                daemon
+run_pause           30          Time in seconds to wait between replication
+                                passes
+concurrency         1           Number of replication workers to spawn
+timeout             5           Timeout value sent to rsync --timeout and
+                                --contimeout options
+stats_interval      3600        Interval in seconds between logging
+                                replication statistics
+reclaim_age         604800      Time elapsed in seconds before an object
+                                can be reclaimed
+==================  ==========  ===========================================
+
+[object-updater]
+
+==================  ==========  ===========================================
+Option              Default     Description
+------------------  ----------  -------------------------------------------
+log_facility        LOG_LOCAL0  Syslog log facility
+log_level           INFO        Logging level
+interval            300         Minimum time for a pass to take
+concurrency         1           Number of updater workers to spawn
+node_timeout        10          Request timeout to external services
+conn_timeout        0.5         Connection timeout to external services
+slowdown            0.01        Time in seconds to wait between objects
+==================  ==========  ===========================================
+
+[object-auditor]
+
+==================  ==========  ===========================================
+Option              Default     Description
+------------------  ----------  -------------------------------------------
+log_facility        LOG_LOCAL0  Syslog log facility
+log_level           INFO        Logging level
+interval            1800        Minimum time for a pass to take
+node_timeout        10          Request timeout to external services
+conn_timeout        0.5         Connection timeout to external services
+==================  ==========  ===========================================
+
+------------------------------
+Container Server Configuration
+------------------------------
+
+An example Container Server configuration can be found at 
+etc/container-server.conf-sample in the source code repository.
+
+The following configuration options are available:
+
+[container-server]
+
+==================  ==========  ============================================
+Option              Default     Description
+------------------  ----------  --------------------------------------------
+log_facility        LOG_LOCAL0  Syslog log facility
+log_level           INFO        Logging level
+swift_dir           /etc/swift  Swift configuration directory
+devices             /srv/node   Parent irectory of where devices are mounted
+mount_check         true        Weather or not check if the devices are
+                                mounted to prevent accidently writing
+                                to the root device
+bind_ip             0.0.0.0     IP Address for server to bind to
+bind_port           6001        Port for server to bind to
+workers             1           Number of workers to fork
+user                swift       User to run as
+node_timeout        3           Request timeout to external services
+conn_timeout        0.5         Connection timeout to external services
+==================  ==========  ============================================
+
+[container-replicator]
+
+==================  ==========  ===========================================
+Option              Default     Description
+------------------  ----------  -------------------------------------------
+log_facility        LOG_LOCAL0  Syslog log facility
+log_level           INFO        Logging level
+per_diff            1000
+concurrency         8           Number of replication workers to spawn
+run_pause           30          Time in seconds to wait between replication
+                                passes
+node_timeout        10          Request timeout to external services
+conn_timeout        0.5         Connection timeout to external services
+reclaim_age         604800      Time elapsed in seconds before a container
+                                can be reclaimed
+==================  ==========  ===========================================
+
+[container-updater]
+
+==================  ==========  ===========================================
+Option              Default     Description
+------------------  ----------  -------------------------------------------
+log_facility        LOG_LOCAL0  Syslog log facility
+log_level           INFO        Logging level
+interval            300         Minimum time for a pass to take
+concurrency         4           Number of updater workers to spawn
+node_timeout        3           Request timeout to external services
+conn_timeout        0.5         Connection timeout to external services
+slowdown            0.01        Time in seconds to wait between containers
+==================  ==========  ===========================================
+
+[container-auditor]
+
+==================  ==========  ===========================================
+Option              Default     Description
+------------------  ----------  -------------------------------------------
+log_facility        LOG_LOCAL0  Syslog log facility
+log_level           INFO        Logging level
+interval            1800        Minimum time for a pass to take
+node_timeout        10          Request timeout to external services
+conn_timeout        0.5         Connection timeout to external services
+==================  ==========  ===========================================
+
+----------------------------
+Account Server Configuration
+----------------------------
+
+An example Account Server configuration can be found at 
+etc/account-server.conf-sample in the source code repository.
+
+The following configuration options are available:
+
+[account-server]
+
+==================  ==========  =============================================
+Option              Default     Description
+------------------  ----------  ---------------------------------------------
+log_facility        LOG_LOCAL0  Syslog log facility
+log_level           INFO        Logging level
+swift_dir           /etc/swift  Swift configuration directory
+devices             /srv/node   Parent directory or where devices are mounted
+mount_check         true        Weather or not check if the devices are
+                                mounted to prevent accidently writing
+                                to the root device
+bind_ip             0.0.0.0     IP Address for server to bind to
+bind_port           6002        Port for server to bind to
+workers             1           Number of workers to fork
+user                swift       User to run as
+==================  ==========  =============================================
+
+[account-replicator]
+
+==================  ==========  ===========================================
+Option              Default     Description
+------------------  ----------  -------------------------------------------
+log_facility        LOG_LOCAL0  Syslog log facility
+log_level           INFO        Logging level
+per_diff            1000
+concurrency         8           Number of replication workers to spawn
+run_pause           30          Time in seconds to wait between replication
+                                passes
+node_timeout        10           Request timeout to external services
+conn_timeout        0.5         Connection timeout to external services
+reclaim_age         604800      Time elapsed in seconds before a account
+                                can be reclaimed
+==================  ==========  ===========================================
+
+[account-auditor]
+
+====================  ==========  ===========================================
+Option                Default     Description
+--------------------  ----------  -------------------------------------------
+log_facility          LOG_LOCAL0  Syslog log facility
+log_level             INFO        Logging level
+interval              1800        Minimum time for a pass to take
+max_container_count   100         Maximum containers randomly picked for
+                                  a given account audit
+node_timeout          10          Request timeout to external services
+conn_timeout          0.5         Connection timeout to external services
+====================  ==========  ===========================================
+
+[account-reaper]
+
+==================  ==========  ===========================================
+Option              Default     Description
+------------------  ----------  -------------------------------------------
+log_facility        LOG_LOCAL0  Syslog log facility
+log_level           INFO        Logging level
+concurrency         25          Number of replication workers to spawn
+interval            3600        Minimum time for a pass to take
+node_timeout        10          Request timeout to external services
+conn_timeout        0.5         Connection timeout to external services
+==================  ==========  ===========================================
+
+----------------------
+General Service Tuning
+----------------------
+
+Most services support either a worker or concurrency value in the settings.
+This allows the services to make effective use of the cores available. A good
+starting point to set the concurrency level for the proxy and storage services
+to 2 times the number of cores available. If more than one service is
+sharing a server, then some experimentaiton may be needed to find the best
+balance.
+
+At Rackspace, our Proxy servers have dual quad core processors, giving us 8
+cores. Our testing has shown 16 workers to be a pretty good balance when
+saturating a 10g network and gives good CPU utilization.
+
+Our Storage servers all run together on the same servers. These servers have
+dual quad core processors, for 8 cores total. We run the Account, Container,
+and Object servers with 8 workers each. Most of the background jobs are run
+at a concurrency of 1, with the exception of the replicators which are run at
+a concurrency of 2.
+
+The above configuration setting should be taken as suggestions and testing
+of configuration settings should be done to ensure best utilization of CPU,
+network connectivity, and disk I/O.
+
+-------------------------
+Filesystem Considerations
+-------------------------
+
+Swift is designed to be mostly filesystem agnostic--the only requirement
+beeing that the filesystem supports extended attributes (xattrs). After
+thorough testing with our use cases and hardware configurations, XFS was
+the best all-around choice. If you decide to use a filesystem other than
+XFS, we highly recommend thorough testing.
+
+If you are using XFS, some settings that can dramatically impact
+performance. We recommend the following when creating the XFS 
+partition::
+
+    mkfs.xfs -i size=1024 -f /dev/sda1
+
+Setting the inode size is important, as XFS stores xattr data in the inode.
+If the metadata is too large to fit in the inode, a new extent is created,
+which can cause quite a performance problem. Upping the inode size to 1024
+bytes provides enough room to write the default metadata, plus a little
+headroom. We do not recommend running Swift on RAID, but if you are using
+RAID it is also important to make sure that the proper sunit and swidth
+settings get set so that XFS can make most efficient use of the RAID array.
+
+We also recommend the following example mount options when using XFS::
+
+    mount -t xfs -o noatime,nodiratime,nobarrier,logbufs=8 /dev/sda1 /srv/node/sda
+
+For a standard swift install, all data drives are mounted directly under
+/srv/node (as can be seen in the above example of mounting /def/sda1 as
+/srv/node/sda). If you choose to mount the drives in another directory,
+be sure to set the `devices` config option in all of the server configs to
+point to the correct directory.  
+
+---------------------
+General System Tuning
+---------------------
+
+Rackspace currently runs Swift on Ubuntu Server 10.04, and the following
+changes have been found to be useful for our use cases.
+
+The following settings should be in `/etc/sysctl.conf`::
+
+    # disable TIME_WAIT.. wait..
+    net.ipv4.tcp_tw_recycle=1
+    net.ipv4.tcp_tw_reuse=1
+
+    # disable syn cookies
+    net.ipv4.tcp_syncookies = 0
+
+    # double amount of allowed conntrack
+    net.ipv4.netfilter.ip_conntrack_max = 262144
+
+To load the updated sysctl settings, run ``sudo sysctl -p``
+
+A note about changing the TIME_WAIT values.  By default the OS will hold
+a port open for 60 seconds to ensure that any remaining packets can be
+received.  During high usage, and with the number of connections that are
+created, it is easy to run out of ports.  We can change this since we are
+in control of the network.  If you are not in control of the network, or
+do not expect high loads, then you may not want to adjust those values.
+
+----------------------
+Logging Considerations
+----------------------
+
+Swift is set up to log directly to syslog. Every service can be configured
+with the `log_facility` option to set the syslog log facility destination.
+It is recommended to use syslog-ng to route the logs to specific log
+files locally on the server and also to remote log collecting servers.
--- a/doc/source/index.rst
+++ b/doc/source/index.rst
@ -33,6 +33,13 @@ Development:
    development_guidelines
    development_saio

+Deployment:
+
+.. toctree::
+    :maxdepth: 1
+
+    deployment_guide
+
 Source:

 .. toctree::
--- a/doc/source/overview_auth.rst
+++ b/doc/source/overview_auth.rst
@ -2,9 +2,13 @@
 The Auth System
 ===============

-The auth system for Swift is based on the auth system from an existing
-architecture -- actually from a few existing auth systems -- and is therefore a
-bit disjointed. The distilled points about it are:
+--------------
+Developer Auth
+--------------
+
+The auth system for Swift is based on the auth system from the existing
+Rackspace architecture -- actually from a few existing auth systems --
+and is therefore a bit disjointed. The distilled points about it are:

 * The authentication/authorization part is outside Swift itself
 * The user of Swift passes in an auth token with each request
@ -19,29 +23,29 @@ of something unique, some use "something else" but the salient point is that
 the token is a string which can be sent as-is back to the auth system for
 validation.

-The validation call is, for historical reasons, an XMLRPC call. There are two
-types of auth systems, type 0 and type 1. With type 0, the XMLRPC call is given
-the token and the Swift account name (also known as the account hash because
-it's usually of the format <reseller>_<hash>). With type 1, the call is given
-the container name and HTTP method as well as the token and account hash. Both
-types are also given a service login and password recorded in Swift's
-resellers.conf. For a valid token, both auth system types respond with a
-session TTL and overall expiration in seconds from now. Swift does not honor
-the session TTL but will cache the token up to the expiration time. Tokens can
-be purged through a call to Swift's services server.
+An auth call is given the auth token and the Swift account hash. For a valid
+token, the auth system responds with a session TTL and overall expiration in
+seconds from now. Swift does not honor the session TTL but will cache the
+token up to the expiration time. Tokens can be purged through a call to the
+auth system.

-How the user gets the token to use with Swift is up to the reseller software
-itself. For instance, with Cloud Files the user has a starting URL to an auth
-system. The user starts a session by sending a ReST request to that auth system
-to receive the auth token, a URL to the Swift system, and a URL to the CDN
-system.
+The user starts a session by sending a ReST request to that auth system
+to receive the auth token and a URL to the Swift system.
+
+--------------
+Extending Auth
+--------------
+
+Auth is written as wsgi middleware, so implementing your own auth is as easy
+as writing new wsgi middleware, and plugging it in to the proxy server.
+
+The current middleware is implemented in the DevAuthMiddleware class in
+swift/common/auth.py, and should be a good starting place for implemeting
+your own auth.

 ------------------
 History and Future
 ------------------

 What's established in Swift for authentication/authorization has history from
-before Swift, so that won't be recorded here. It was minimally integrated with
-Swift to meet project deadlines, but in the near future Swift should have a
-pluggable auth/reseller system to support the above as well as other
-architectures.
+before Swift, so that won't be recorded here.