Foundational support for PUT and GET of erasure-coded objects

This commit makes it possible to PUT an object into Swift and have it stored using erasure coding instead of replication, and also to GET the object back from Swift at a later time. This works by splitting the incoming object into a number of segments, erasure-coding each segment in turn to get fragments, then concatenating the fragments into fragment archives. Segments are 1 MiB in size, except the last, which is between 1 B and 1 MiB. +====================================================================+ | object data | +====================================================================+ | +------------------------+----------------------+ | | | v v v +===================+ +===================+ +==============+ | segment 1 | | segment 2 | ... | segment N | +===================+ +===================+ +==============+ | | | | v v /=========\ /=========\ | pyeclib | | pyeclib | ... \=========/ \=========/ | | | | +--> fragment A-1 +--> fragment A-2 | | | | | | | | | | +--> fragment B-1 +--> fragment B-2 | | | | ... ... Then, object server A gets the concatenation of fragment A-1, A-2, ..., A-N, so its .data file looks like this (called a "fragment archive"): +=====================================================================+ | fragment A-1 | fragment A-2 | ... | fragment A-N | +=====================================================================+ Since this means that the object server never sees the object data as the client sent it, we have to do a few things to ensure data integrity. First, the proxy has to check the Etag if the client provided it; the object server can't do it since the object server doesn't see the raw data. Second, if the client does not provide an Etag, the proxy computes it and uses the MIME-PUT mechanism to provide it to the object servers after the object body. Otherwise, the object would not have an Etag at all. Third, the proxy computes the MD5 of each fragment archive and sends it to the object server using the MIME-PUT mechanism. With replicated objects, the proxy checks that the Etags from all the object servers match, and if they don't, returns a 500 to the client. This mitigates the risk of data corruption in one of the proxy --> object connections, and signals to the client when it happens. With EC objects, we can't use that same mechanism, so we must send the checksum with each fragment archive to get comparable protection. On the GET path, the inverse happens: the proxy connects to a bunch of object servers (M of them, for an M+K scheme), reads one fragment at a time from each fragment archive, decodes those fragments into a segment, and serves the segment to the client. When an object server dies partway through a GET response, any partially-fetched fragment is discarded, the resumption point is wound back to the nearest fragment boundary, and the GET is retried with the next object server. GET requests for a single byterange work; GET requests for multiple byteranges do not. There are a number of things _not_ included in this commit. Some of them are listed here: * multi-range GET * deferred cleanup of old .data files * durability (daemon to reconstruct missing archives) Co-Authored-By: Alistair Coles <alistair.coles@hp.com> Co-Authored-By: Thiago da Silva <thiago@redhat.com> Co-Authored-By: John Dickinson <me@not.mn> Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com> Co-Authored-By: Tushar Gohad <tushar.gohad@intel.com> Co-Authored-By: Paul Luse <paul.e.luse@intel.com> Co-Authored-By: Christian Schwede <christian.schwede@enovance.com> Co-Authored-By: Yuan Zhou <yuan.zhou@intel.com> Change-Id: I9c13c03616489f8eab7dcd7c5f21237ed4cb6fd2
2014-10-22 13:18:34 -07:00 · 2014-10-22 13:18:34 -07:00 · decbcd24d4
commit decbcd24d4
parent b1eda4aef8
19 changed files with 3882 additions and 557 deletions
--- a/swift/common/exceptions.py
+++ b/swift/common/exceptions.py
@ -31,10 +31,28 @@ class SwiftException(Exception):
    pass


+class PutterConnectError(Exception):
+
+    def __init__(self, status=None):
+        self.status = status
+
+
 class InvalidTimestamp(SwiftException):
    pass


+class InsufficientStorage(SwiftException):
+    pass
+
+
+class FooterNotSupported(SwiftException):
+    pass
+
+
+class MultiphasePUTNotSupported(SwiftException):
+    pass
+
+
 class DiskFileError(SwiftException):
    pass

@ -103,6 +121,10 @@ class ConnectionTimeout(Timeout):
    pass


+class ResponseTimeout(Timeout):
+    pass
+
+
 class DriveNotMounted(SwiftException):
    pass

--- a/swift/common/middleware/formpost.py
+++ b/swift/common/middleware/formpost.py
@ -218,7 +218,14 @@ class FormPost(object):
                        env, attrs['boundary'])
                    start_response(status, headers)
                    return [body]
-            except (FormInvalid, MimeInvalid, EOFError) as err:
+            except MimeInvalid:
+                body = 'FormPost: invalid starting boundary'
+                start_response(
+                    '400 Bad Request',
+                    (('Content-Type', 'text/plain'),
+                     ('Content-Length', str(len(body)))))
+                return [body]
+            except (FormInvalid, EOFError) as err:
                body = 'FormPost: %s' % err
                start_response(
                    '400 Bad Request',
--- a/swift/common/ring/ring.py
+++ b/swift/common/ring/ring.py
@ -243,7 +243,7 @@ class Ring(object):
                if dev_id not in seen_ids:
                    part_nodes.append(self.devs[dev_id])
                    seen_ids.add(dev_id)
-        return part_nodes
+        return [dict(node, index=i) for i, node in enumerate(part_nodes)]

    def get_part(self, account, container=None, obj=None):
        """
@ -291,6 +291,7 @@ class Ring(object):

        ======  ===============================================================
        id      unique integer identifier amongst devices
+        index   offset into the primary node list for the partition
        weight  a float of the relative weight of this device as compared to
                others; this indicates how many partitions the builder will try
                to assign to this device
--- a/swift/common/storage_policy.py
+++ b/swift/common/storage_policy.py
@ -356,6 +356,36 @@ class ECStoragePolicy(BaseStoragePolicy):
    def ec_segment_size(self):
        return self._ec_segment_size

+    @property
+    def fragment_size(self):
+        """
+        Maximum length of a fragment, including header.
+
+        NB: a fragment archive is a sequence of 0 or more max-length
+        fragments followed by one possibly-shorter fragment.
+        """
+        # Technically pyeclib's get_segment_info signature calls for
+        # (data_len, segment_size) but on a ranged GET we don't know the
+        # ec-content-length header before we need to compute where in the
+        # object we should request to align with the fragment size.  So we
+        # tell pyeclib a lie - from it's perspective, as long as data_len >=
+        # segment_size it'll give us the answer we want.  From our
+        # perspective, because we only use this answer to calculate the
+        # *minimum* size we should read from an object body even if data_len <
+        # segment_size we'll still only read *the whole one and only last
+        # fragment* and pass than into pyeclib who will know what to do with
+        # it just as it always does when the last fragment is < fragment_size.
+        return self.pyeclib_driver.get_segment_info(
+            self.ec_segment_size, self.ec_segment_size)['fragment_size']
+
+    @property
+    def ec_scheme_description(self):
+        """
+        This short hand form of the important parts of the ec schema is stored
+        in Object System Metadata on the EC Fragment Archives for debugging.
+        """
+        return "%s %d+%d" % (self._ec_type, self._ec_ndata, self._ec_nparity)
+
    def __repr__(self):
        return ("%s, EC config(ec_type=%s, ec_segment_size=%d, "
                "ec_ndata=%d, ec_nparity=%d)") % (
--- a/swift/common/utils.py
+++ b/swift/common/utils.py
@ -2236,11 +2236,16 @@ class GreenAsyncPile(object):

    Correlating results with jobs (if necessary) is left to the caller.
    """
-    def __init__(self, size):
+    def __init__(self, size_or_pool):
        """
-        :param size: size pool of green threads to use
+        :param size_or_pool: thread pool size or a pool to use
        """
-        self._pool = GreenPool(size)
+        if isinstance(size_or_pool, GreenPool):
+            self._pool = size_or_pool
+            size = self._pool.size
+        else:
+            self._pool = GreenPool(size_or_pool)
+            size = size_or_pool
        self._responses = eventlet.queue.LightQueue(size)
        self._inflight = 0

@ -2646,6 +2651,10 @@ def public(func):

 def quorum_size(n):
    """
+    quorum size as it applies to services that use 'replication' for data
+    integrity  (Account/Container services).  Object quorum_size is defined
+    on a storage policy basis.
+
    Number of successful backend requests needed for the proxy to consider
    the client request successful.
    """
@ -3139,6 +3148,26 @@ _rfc_extension_pattern = re.compile(
    r'(?:\s*;\s*(' + _rfc_token + r")\s*(?:=\s*(" + _rfc_token +
    r'|"(?:[^"\\]|\\.)*"))?)')

+_content_range_pattern = re.compile(r'^bytes (\d+)-(\d+)/(\d+)$')
+
+
+def parse_content_range(content_range):
+    """
+    Parse a content-range header into (first_byte, last_byte, total_size).
+
+    See RFC 7233 section 4.2 for details on the header format, but it's
+    basically "Content-Range: bytes ${start}-${end}/${total}".
+
+    :param content_range: Content-Range header value to parse,
+        e.g. "bytes 100-1249/49004"
+    :returns: 3-tuple (start, end, total)
+    :raises: ValueError if malformed
+    """
+    found = re.search(_content_range_pattern, content_range)
+    if not found:
+        raise ValueError("malformed Content-Range %r" % (content_range,))
+    return tuple(int(x) for x in found.groups())
+

 def parse_content_type(content_type):
    """
@ -3293,8 +3322,11 @@ def iter_multipart_mime_documents(wsgi_input, boundary, read_chunk_size=4096):
    :raises: MimeInvalid if the document is malformed
    """
    boundary = '--' + boundary
-    if wsgi_input.readline(len(boundary + '\r\n')).strip() != boundary:
-        raise swift.common.exceptions.MimeInvalid('invalid starting boundary')
+    blen = len(boundary) + 2  # \r\n
+    got = wsgi_input.readline(blen)
+    if got.strip() != boundary:
+        raise swift.common.exceptions.MimeInvalid(
+            'invalid starting boundary: wanted %r, got %r', (boundary, got))
    boundary = '\r\n' + boundary
    input_buffer = ''
    done = False
--- a/swift/obj/diskfile.py
+++ b/swift/obj/diskfile.py
@ -530,6 +530,11 @@ class DiskFileRouter(object):
        their DiskFile implementation.
        """
        def register_wrapper(diskfile_cls):
+            if policy_type in cls.policy_type_to_manager_cls:
+                raise PolicyError(
+                    '%r is already registered for the policy_type %r' % (
+                        cls.policy_type_to_manager_cls[policy_type],
+                        policy_type))
            cls.policy_type_to_manager_cls[policy_type] = diskfile_cls
            return diskfile_cls
        return register_wrapper
--- a/swift/proxy/controllers/init.py
+++ b/swift/proxy/controllers/init.py
@ -13,7 +13,7 @@

 from swift.proxy.controllers.base import Controller
 from swift.proxy.controllers.info import InfoController
-from swift.proxy.controllers.obj import ObjectController
+from swift.proxy.controllers.obj import ObjectControllerRouter
 from swift.proxy.controllers.account import AccountController
 from swift.proxy.controllers.container import ContainerController

@ -22,5 +22,5 @@ __all__ = [
    'ContainerController',
    'Controller',
    'InfoController',
-    'ObjectController',
+    'ObjectControllerRouter',
 ]
--- a/swift/proxy/controllers/account.py
+++ b/swift/proxy/controllers/account.py
@ -58,9 +58,10 @@ class AccountController(Controller):
                         constraints.MAX_ACCOUNT_NAME_LENGTH)
            return resp

-        partition, nodes = self.app.account_ring.get_nodes(self.account_name)
+        partition = self.app.account_ring.get_part(self.account_name)
+        node_iter = self.app.iter_nodes(self.app.account_ring, partition)
        resp = self.GETorHEAD_base(
-            req, _('Account'), self.app.account_ring, partition,
+            req, _('Account'), node_iter, partition,
            req.swift_entity_path.rstrip('/'))
        if resp.status_int == HTTP_NOT_FOUND:
            if resp.headers.get('X-Account-Status', '').lower() == 'deleted':
--- a/swift/proxy/controllers/base.py
+++ b/swift/proxy/controllers/base.py
@ -28,6 +28,7 @@ import os
 import time
 import functools
 import inspect
+import logging
 import operator
 from sys import exc_info
 from swift import gettext_ as _
@ -39,14 +40,14 @@ from eventlet.timeout import Timeout
 from swift.common.wsgi import make_pre_authed_env
 from swift.common.utils import Timestamp, config_true_value, \
    public, split_path, list_from_csv, GreenthreadSafeIterator, \
-    quorum_size, GreenAsyncPile
+    GreenAsyncPile, quorum_size, parse_content_range
 from swift.common.bufferedhttp import http_connect
 from swift.common.exceptions import ChunkReadTimeout, ChunkWriteTimeout, \
    ConnectionTimeout
 from swift.common.http import is_informational, is_success, is_redirection, \
    is_server_error, HTTP_OK, HTTP_PARTIAL_CONTENT, HTTP_MULTIPLE_CHOICES, \
    HTTP_BAD_REQUEST, HTTP_NOT_FOUND, HTTP_SERVICE_UNAVAILABLE, \
-    HTTP_INSUFFICIENT_STORAGE, HTTP_UNAUTHORIZED
+    HTTP_INSUFFICIENT_STORAGE, HTTP_UNAUTHORIZED, HTTP_CONTINUE
 from swift.common.swob import Request, Response, HeaderKeyDict, Range, \
    HTTPException, HTTPRequestedRangeNotSatisfiable
 from swift.common.request_helpers import strip_sys_meta_prefix, \
@ -593,16 +594,37 @@ def close_swift_conn(src):
        pass


+def bytes_to_skip(record_size, range_start):
+    """
+    Assume an object is composed of N records, where the first N-1 are all
+    the same size and the last is at most that large, but may be smaller.
+
+    When a range request is made, it might start with a partial record. This
+    must be discarded, lest the consumer get bad data. This is particularly
+    true of suffix-byte-range requests, e.g. "Range: bytes=-12345" where the
+    size of the object is unknown at the time the request is made.
+
+    This function computes the number of bytes that must be discarded to
+    ensure only whole records are yielded. Erasure-code decoding needs this.
+
+    This function could have been inlined, but it took enough tries to get
+    right that some targeted unit tests were desirable, hence its extraction.
+    """
+    return (record_size - (range_start % record_size)) % record_size
+
+
 class GetOrHeadHandler(object):

-    def __init__(self, app, req, server_type, ring, partition, path,
-                 backend_headers):
+    def __init__(self, app, req, server_type, node_iter, partition, path,
+                 backend_headers, client_chunk_size=None):
        self.app = app
-        self.ring = ring
+        self.node_iter = node_iter
        self.server_type = server_type
        self.partition = partition
        self.path = path
        self.backend_headers = backend_headers
+        self.client_chunk_size = client_chunk_size
+        self.skip_bytes = 0
        self.used_nodes = []
        self.used_source_etag = ''

@ -649,6 +671,35 @@ class GetOrHeadHandler(object):
        else:
            self.backend_headers['Range'] = 'bytes=%d-' % num_bytes

+    def learn_size_from_content_range(self, start, end):
+        """
+        If client_chunk_size is set, makes sure we yield things starting on
+        chunk boundaries based on the Content-Range header in the response.
+
+        Sets our first Range header to the value learned from the
+        Content-Range header in the response; if we were given a
+        fully-specified range (e.g. "bytes=123-456"), this is a no-op.
+
+        If we were given a half-specified range (e.g. "bytes=123-" or
+        "bytes=-456"), then this changes the Range header to a
+        semantically-equivalent one *and* it lets us resume on a proper
+        boundary instead of just in the middle of a piece somewhere.
+
+        If the original request is for more than one range, this does not
+        affect our backend Range header, since we don't support resuming one
+        of those anyway.
+        """
+        if self.client_chunk_size:
+            self.skip_bytes = bytes_to_skip(self.client_chunk_size, start)
+
+        if 'Range' in self.backend_headers:
+            req_range = Range(self.backend_headers['Range'])
+
+            if len(req_range.ranges) > 1:
+                return
+
+            self.backend_headers['Range'] = "bytes=%d-%d" % (start, end)
+
    def is_good_source(self, src):
        """
        Indicates whether or not the request made to the backend found
@ -674,42 +725,74 @@ class GetOrHeadHandler(object):
        """
        try:
            nchunks = 0
-            bytes_read_from_source = 0
+            client_chunk_size = self.client_chunk_size
+            bytes_consumed_from_backend = 0
            node_timeout = self.app.node_timeout
            if self.server_type == 'Object':
                node_timeout = self.app.recoverable_node_timeout
+            buf = ''
            while True:
                try:
                    with ChunkReadTimeout(node_timeout):
                        chunk = source.read(self.app.object_chunk_size)
                        nchunks += 1
-                        bytes_read_from_source += len(chunk)
+                        buf += chunk
                except ChunkReadTimeout:
                    exc_type, exc_value, exc_traceback = exc_info()
                    if self.newest or self.server_type != 'Object':
                        raise exc_type, exc_value, exc_traceback
                    try:
-                        self.fast_forward(bytes_read_from_source)
+                        self.fast_forward(bytes_consumed_from_backend)
                    except (NotImplementedError, HTTPException, ValueError):
                        raise exc_type, exc_value, exc_traceback
+                    buf = ''
                    new_source, new_node = self._get_source_and_node()
                    if new_source:
                        self.app.exception_occurred(
                            node, _('Object'),
-                            _('Trying to read during GET (retrying)'))
+                            _('Trying to read during GET (retrying)'),
+                            level=logging.ERROR, exc_info=(
+                                exc_type, exc_value, exc_traceback))
                        # Close-out the connection as best as possible.
                        if getattr(source, 'swift_conn', None):
                            close_swift_conn(source)
                        source = new_source
                        node = new_node
-                        bytes_read_from_source = 0
                        continue
                    else:
                        raise exc_type, exc_value, exc_traceback
+
+                if buf and self.skip_bytes:
+                    if self.skip_bytes < len(buf):
+                        buf = buf[self.skip_bytes:]
+                        bytes_consumed_from_backend += self.skip_bytes
+                        self.skip_bytes = 0
+                    else:
+                        self.skip_bytes -= len(buf)
+                        bytes_consumed_from_backend += len(buf)
+                        buf = ''
+
                if not chunk:
+                    if buf:
+                        with ChunkWriteTimeout(self.app.client_timeout):
+                            bytes_consumed_from_backend += len(buf)
+                            yield buf
+                        buf = ''
                    break
-                with ChunkWriteTimeout(self.app.client_timeout):
-                    yield chunk
+
+                if client_chunk_size is not None:
+                    while len(buf) >= client_chunk_size:
+                        client_chunk = buf[:client_chunk_size]
+                        buf = buf[client_chunk_size:]
+                        with ChunkWriteTimeout(self.app.client_timeout):
+                            yield client_chunk
+                        bytes_consumed_from_backend += len(client_chunk)
+                else:
+                    with ChunkWriteTimeout(self.app.client_timeout):
+                        yield buf
+                    bytes_consumed_from_backend += len(buf)
+                    buf = ''
+
                # This is for fairness; if the network is outpacing the CPU,
                # we'll always be able to read and write data without
                # encountering an EWOULDBLOCK, and so eventlet will not switch
@ -757,7 +840,7 @@ class GetOrHeadHandler(object):
        node_timeout = self.app.node_timeout
        if self.server_type == 'Object' and not self.newest:
            node_timeout = self.app.recoverable_node_timeout
-        for node in self.app.iter_nodes(self.ring, self.partition):
+        for node in self.node_iter:
            if node in self.used_nodes:
                continue
            start_node_timing = time.time()
@ -793,8 +876,10 @@ class GetOrHeadHandler(object):
                        src_headers = dict(
                            (k.lower(), v) for k, v in
                            possible_source.getheaders())
-                        if src_headers.get('etag', '').strip('"') != \
-                                self.used_source_etag:
+
+                        if self.used_source_etag != src_headers.get(
+                                'x-object-sysmeta-ec-etag',
+                                src_headers.get('etag', '')).strip('"'):
                            self.statuses.append(HTTP_NOT_FOUND)
                            self.reasons.append('')
                            self.bodies.append('')
@ -832,7 +917,9 @@ class GetOrHeadHandler(object):
            src_headers = dict(
                (k.lower(), v) for k, v in
                possible_source.getheaders())
-            self.used_source_etag = src_headers.get('etag', '').strip('"')
+            self.used_source_etag = src_headers.get(
+                'x-object-sysmeta-ec-etag',
+                src_headers.get('etag', '')).strip('"')
            return source, node
        return None, None

@ -841,13 +928,17 @@ class GetOrHeadHandler(object):
        res = None
        if source:
            res = Response(request=req)
+            res.status = source.status
+            update_headers(res, source.getheaders())
            if req.method == 'GET' and \
                    source.status in (HTTP_OK, HTTP_PARTIAL_CONTENT):
+                cr = res.headers.get('Content-Range')
+                if cr:
+                    start, end, total = parse_content_range(cr)
+                    self.learn_size_from_content_range(start, end)
                res.app_iter = self._make_app_iter(req, node, source)
                # See NOTE: swift_conn at top of file about this.
                res.swift_conn = source.swift_conn
-            res.status = source.status
-            update_headers(res, source.getheaders())
            if not res.environ:
                res.environ = {}
            res.environ['swift_x_timestamp'] = \
@ -993,7 +1084,8 @@ class Controller(object):
        else:
            info['partition'] = part
            info['nodes'] = nodes
-            info.setdefault('storage_policy', '0')
+        if info.get('storage_policy') is None:
+            info['storage_policy'] = 0
        return info

    def _make_request(self, nodes, part, method, path, headers, query,
@ -1098,6 +1190,13 @@ class Controller(object):
                                  '%s %s' % (self.server_type, req.method),
                                  overrides=overrides, headers=resp_headers)

+    def _quorum_size(self, n):
+        """
+        Number of successful backend responses needed for the proxy to
+        consider the client request successful.
+        """
+        return quorum_size(n)
+
    def have_quorum(self, statuses, node_count):
        """
        Given a list of statuses from several requests, determine if
@ -1107,16 +1206,18 @@ class Controller(object):
        :param node_count: number of nodes being queried (basically ring count)
        :returns: True or False, depending on if quorum is established
        """
-        quorum = quorum_size(node_count)
+        quorum = self._quorum_size(node_count)
        if len(statuses) >= quorum:
-            for hundred in (HTTP_OK, HTTP_MULTIPLE_CHOICES, HTTP_BAD_REQUEST):
+            for hundred in (HTTP_CONTINUE, HTTP_OK, HTTP_MULTIPLE_CHOICES,
+                            HTTP_BAD_REQUEST):
                if sum(1 for s in statuses
                       if hundred <= s < hundred + 100) >= quorum:
                    return True
        return False

    def best_response(self, req, statuses, reasons, bodies, server_type,
-                      etag=None, headers=None, overrides=None):
+                      etag=None, headers=None, overrides=None,
+                      quorum_size=None):
        """
        Given a list of responses from several servers, choose the best to
        return to the API.
@ -1128,10 +1229,16 @@ class Controller(object):
        :param server_type: type of server the responses came from
        :param etag: etag
        :param headers: headers of each response
+        :param overrides: overrides to apply when lacking quorum
+        :param quorum_size: quorum size to use
        :returns: swob.Response object with the correct status, body, etc. set
        """
+        if quorum_size is None:
+            quorum_size = self._quorum_size(len(statuses))
+
        resp = self._compute_quorum_response(
-            req, statuses, reasons, bodies, etag, headers)
+            req, statuses, reasons, bodies, etag, headers,
+            quorum_size=quorum_size)
        if overrides and not resp:
            faked_up_status_indices = set()
            transformed = []
@ -1145,7 +1252,8 @@ class Controller(object):
            statuses, reasons, headers, bodies = zip(*transformed)
            resp = self._compute_quorum_response(
                req, statuses, reasons, bodies, etag, headers,
-                indices_to_avoid=faked_up_status_indices)
+                indices_to_avoid=faked_up_status_indices,
+                quorum_size=quorum_size)

        if not resp:
            resp = Response(request=req)
@ -1156,14 +1264,14 @@ class Controller(object):
        return resp

    def _compute_quorum_response(self, req, statuses, reasons, bodies, etag,
-                                 headers, indices_to_avoid=()):
+                                 headers, quorum_size, indices_to_avoid=()):
        if not statuses:
            return None
        for hundred in (HTTP_OK, HTTP_MULTIPLE_CHOICES, HTTP_BAD_REQUEST):
            hstatuses = \
                [(i, s) for i, s in enumerate(statuses)
                 if hundred <= s < hundred + 100]
-            if len(hstatuses) >= quorum_size(len(statuses)):
+            if len(hstatuses) >= quorum_size:
                resp = Response(request=req)
                try:
                    status_index, status = max(
@ -1228,22 +1336,25 @@ class Controller(object):
        else:
            self.app.logger.warning('Could not autocreate account %r' % path)

-    def GETorHEAD_base(self, req, server_type, ring, partition, path):
+    def GETorHEAD_base(self, req, server_type, node_iter, partition, path,
+                       client_chunk_size=None):
        """
        Base handler for HTTP GET or HEAD requests.

        :param req: swob.Request object
        :param server_type: server type used in logging
-        :param ring: the ring to obtain nodes from
+        :param node_iter: an iterator to obtain nodes from
        :param partition: partition
        :param path: path for the request
+        :param client_chunk_size: chunk size for response body iterator
        :returns: swob.Response object
        """
        backend_headers = self.generate_request_headers(
            req, additional=req.headers)

-        handler = GetOrHeadHandler(self.app, req, self.server_type, ring,
-                                   partition, path, backend_headers)
+        handler = GetOrHeadHandler(self.app, req, self.server_type, node_iter,
+                                   partition, path, backend_headers,
+                                   client_chunk_size=client_chunk_size)
        res = handler.get_working_response(req)

        if not res:
--- a/swift/proxy/controllers/container.py
+++ b/swift/proxy/controllers/container.py
@ -93,8 +93,9 @@ class ContainerController(Controller):
            return HTTPNotFound(request=req)
        part = self.app.container_ring.get_part(
            self.account_name, self.container_name)
+        node_iter = self.app.iter_nodes(self.app.container_ring, part)
        resp = self.GETorHEAD_base(
-            req, _('Container'), self.app.container_ring, part,
+            req, _('Container'), node_iter, part,
            req.swift_entity_path)
        if 'swift.authorize' in req.environ:
            req.acl = resp.headers.get('x-container-read')
--- a/swift/proxy/controllers/obj.py
+++ b/swift/proxy/controllers/obj.py
--- a/swift/proxy/server.py
+++ b/swift/proxy/server.py
@ -20,6 +20,8 @@ from swift import gettext_ as _
 from random import shuffle
 from time import time
 import itertools
+import functools
+import sys

 from eventlet import Timeout

@ -32,11 +34,12 @@ from swift.common.utils import cache_from_env, get_logger, \
    affinity_key_function, affinity_locality_predicate, list_from_csv, \
    register_swift_info
 from swift.common.constraints import check_utf8
-from swift.proxy.controllers import AccountController, ObjectController, \
-    ContainerController, InfoController
+from swift.proxy.controllers import AccountController, ContainerController, \
+    ObjectControllerRouter, InfoController
+from swift.proxy.controllers.base import get_container_info
 from swift.common.swob import HTTPBadRequest, HTTPForbidden, \
    HTTPMethodNotAllowed, HTTPNotFound, HTTPPreconditionFailed, \
-    HTTPServerError, HTTPException, Request
+    HTTPServerError, HTTPException, Request, HTTPServiceUnavailable


 # List of entry points for mandatory middlewares.
@ -109,6 +112,7 @@ class Application(object):
        # ensure rings are loaded for all configured storage policies
        for policy in POLICIES:
            policy.load_ring(swift_dir)
+        self.obj_controller_router = ObjectControllerRouter()
        self.memcache = memcache
        mimetypes.init(mimetypes.knownfiles +
                       [os.path.join(swift_dir, 'mime.types')])
@ -235,29 +239,44 @@ class Application(object):
        """
        return POLICIES.get_object_ring(policy_idx, self.swift_dir)

-    def get_controller(self, path):
+    def get_controller(self, req):
        """
        Get the controller to handle a request.

-        :param path: path from request
+        :param req: the request
        :returns: tuple of (controller class, path dictionary)

        :raises: ValueError (thrown by split_path) if given invalid path
        """
-        if path == '/info':
+        if req.path == '/info':
            d = dict(version=None,
                     expose_info=self.expose_info,
                     disallowed_sections=self.disallowed_sections,
                     admin_key=self.admin_key)
            return InfoController, d

-        version, account, container, obj = split_path(path, 1, 4, True)
+        version, account, container, obj = split_path(req.path, 1, 4, True)
        d = dict(version=version,
                 account_name=account,
                 container_name=container,
                 object_name=obj)
        if obj and container and account:
-            return ObjectController, d
+            info = get_container_info(req.environ, self)
+            policy_index = req.headers.get('X-Backend-Storage-Policy-Index',
+                                           info['storage_policy'])
+            policy = POLICIES.get_by_index(policy_index)
+            if not policy:
+                # This indicates that a new policy has been created,
+                # with rings, deployed, released (i.e. deprecated =
+                # False), used by a client to create a container via
+                # another proxy that was restarted after the policy
+                # was released, and is now cached - all before this
+                # worker was HUPed to stop accepting new
+                # connections.  There should never be an "unknown"
+                # index - but when there is - it's probably operator
+                # error and hopefully temporary.
+                raise HTTPServiceUnavailable('Unknown Storage Policy')
+            return self.obj_controller_router[policy], d
        elif container and account:
            return ContainerController, d
        elif account and not container and not obj:
@ -317,7 +336,7 @@ class Application(object):
                    request=req, body='Invalid UTF8 or contains NULL')

            try:
-                controller, path_parts = self.get_controller(req.path)
+                controller, path_parts = self.get_controller(req)
                p = req.path_info
                if isinstance(p, unicode):
                    p = p.encode('utf-8')
@ -474,9 +493,9 @@ class Application(object):
    def iter_nodes(self, ring, partition, node_iter=None):
        """
        Yields nodes for a ring partition, skipping over error
-        limited nodes and stopping at the configurable number of
-        nodes. If a node yielded subsequently gets error limited, an
-        extra node will be yielded to take its place.
+        limited nodes and stopping at the configurable number of nodes. If a
+        node yielded subsequently gets error limited, an extra node will be
+        yielded to take its place.

        Note that if you're going to iterate over this concurrently from
        multiple greenthreads, you'll want to use a
@ -527,7 +546,8 @@ class Application(object):
                    if nodes_left <= 0:
                        return

-    def exception_occurred(self, node, typ, additional_info):
+    def exception_occurred(self, node, typ, additional_info,
+                           **kwargs):
        """
        Handle logging of generic exceptions.

@ -536,11 +556,18 @@ class Application(object):
        :param additional_info: additional information to log
        """
        self._incr_node_errors(node)
-        self.logger.exception(
-            _('ERROR with %(type)s server %(ip)s:%(port)s/%(device)s re: '
-              '%(info)s'),
-            {'type': typ, 'ip': node['ip'], 'port': node['port'],
-             'device': node['device'], 'info': additional_info})
+        if 'level' in kwargs:
+            log = functools.partial(self.logger.log, kwargs.pop('level'))
+            if 'exc_info' not in kwargs:
+                kwargs['exc_info'] = sys.exc_info()
+        else:
+            log = self.logger.exception
+        log(_('ERROR with %(type)s server %(ip)s:%(port)s/%(device)s'
+              ' re: %(info)s'), {
+                  'type': typ, 'ip': node['ip'], 'port':
+                  node['port'], 'device': node['device'],
+                  'info': additional_info
+              }, **kwargs)

    def modify_wsgi_pipeline(self, pipe):
        """
--- a/test/unit/init.py
+++ b/test/unit/init.py
@ -67,11 +67,11 @@ def patch_policies(thing_or_policies=None, legacy_only=False,
    elif with_ec_default:
        default_policies = [
            ECStoragePolicy(0, name='ec', is_default=True,
-                            ec_type='jerasure_rs_vand', ec_ndata=4,
-                            ec_nparity=2, ec_segment_size=4096),
+                            ec_type='jerasure_rs_vand', ec_ndata=10,
+                            ec_nparity=4, ec_segment_size=4096),
            StoragePolicy(1, name='unu'),
        ]
-        default_ring_args = [{'replicas': 6}, {}]
+        default_ring_args = [{'replicas': 14}, {}]
    else:
        default_policies = [
            StoragePolicy(0, name='nulo', is_default=True),
@ -223,7 +223,7 @@ class FakeRing(Ring):
        return self.replicas

    def _get_part_nodes(self, part):
-        return list(self._devs)
+        return [dict(node, index=i) for i, node in enumerate(list(self._devs))]

    def get_more_nodes(self, part):
        # replicas^2 is the true cap
--- a/test/unit/account/test_reaper.py
+++ b/test/unit/account/test_reaper.py
@ -297,7 +297,8 @@ class TestReaper(unittest.TestCase):
                        'X-Backend-Storage-Policy-Index': policy.idx
                    }
                    ring = r.get_object_ring(policy.idx)
-                    expected = call(ring.devs[i], 0, 'a', 'c', 'o',
+                    expected = call(dict(ring.devs[i], index=i), 0,
+                                    'a', 'c', 'o',
                                    headers=headers, conn_timeout=0.5,
                                    response_timeout=10)
                    self.assertEqual(call_args, expected)
--- a/test/unit/common/ring/test_ring.py
+++ b/test/unit/common/ring/test_ring.py
@ -363,63 +363,74 @@ class TestRing(TestRingBase):
        self.assertRaises(TypeError, self.ring.get_nodes)
        part, nodes = self.ring.get_nodes('a')
        self.assertEquals(part, 0)
-        self.assertEquals(nodes, [self.intended_devs[0],
-                                  self.intended_devs[3]])
+        self.assertEquals(nodes, [dict(node, index=i) for i, node in
+                          enumerate([self.intended_devs[0],
+                          self.intended_devs[3]])])

        part, nodes = self.ring.get_nodes('a1')
        self.assertEquals(part, 0)
-        self.assertEquals(nodes, [self.intended_devs[0],
-                                  self.intended_devs[3]])
+        self.assertEquals(nodes, [dict(node, index=i) for i, node in
+                          enumerate([self.intended_devs[0],
+                          self.intended_devs[3]])])

        part, nodes = self.ring.get_nodes('a4')
        self.assertEquals(part, 1)
-        self.assertEquals(nodes, [self.intended_devs[1],
-                                  self.intended_devs[4]])
+        self.assertEquals(nodes, [dict(node, index=i) for i, node in
+                          enumerate([self.intended_devs[1],
+                          self.intended_devs[4]])])

        part, nodes = self.ring.get_nodes('aa')
        self.assertEquals(part, 1)
-        self.assertEquals(nodes, [self.intended_devs[1],
-                                  self.intended_devs[4]])
+        self.assertEquals(nodes, [dict(node, index=i) for i, node in
+                          enumerate([self.intended_devs[1],
+                          self.intended_devs[4]])])

        part, nodes = self.ring.get_nodes('a', 'c1')
        self.assertEquals(part, 0)
-        self.assertEquals(nodes, [self.intended_devs[0],
-                                  self.intended_devs[3]])
+        self.assertEquals(nodes, [dict(node, index=i) for i, node in
+                          enumerate([self.intended_devs[0],
+                          self.intended_devs[3]])])

        part, nodes = self.ring.get_nodes('a', 'c0')
        self.assertEquals(part, 3)
-        self.assertEquals(nodes, [self.intended_devs[1],
-                                  self.intended_devs[4]])
+        self.assertEquals(nodes, [dict(node, index=i) for i, node in
+                          enumerate([self.intended_devs[1],
+                          self.intended_devs[4]])])

        part, nodes = self.ring.get_nodes('a', 'c3')
        self.assertEquals(part, 2)
-        self.assertEquals(nodes, [self.intended_devs[0],
-                                  self.intended_devs[3]])
+        self.assertEquals(nodes, [dict(node, index=i) for i, node in
+                          enumerate([self.intended_devs[0],
+                          self.intended_devs[3]])])

        part, nodes = self.ring.get_nodes('a', 'c2')
-        self.assertEquals(part, 2)
-        self.assertEquals(nodes, [self.intended_devs[0],
-                                  self.intended_devs[3]])
+        self.assertEquals(nodes, [dict(node, index=i) for i, node in
+                          enumerate([self.intended_devs[0],
+                          self.intended_devs[3]])])

        part, nodes = self.ring.get_nodes('a', 'c', 'o1')
        self.assertEquals(part, 1)
-        self.assertEquals(nodes, [self.intended_devs[1],
-                                  self.intended_devs[4]])
+        self.assertEquals(nodes, [dict(node, index=i) for i, node in
+                          enumerate([self.intended_devs[1],
+                          self.intended_devs[4]])])

        part, nodes = self.ring.get_nodes('a', 'c', 'o5')
        self.assertEquals(part, 0)
-        self.assertEquals(nodes, [self.intended_devs[0],
-                                  self.intended_devs[3]])
+        self.assertEquals(nodes, [dict(node, index=i) for i, node in
+                          enumerate([self.intended_devs[0],
+                          self.intended_devs[3]])])

        part, nodes = self.ring.get_nodes('a', 'c', 'o0')
        self.assertEquals(part, 0)
-        self.assertEquals(nodes, [self.intended_devs[0],
-                                  self.intended_devs[3]])
+        self.assertEquals(nodes, [dict(node, index=i) for i, node in
+                          enumerate([self.intended_devs[0],
+                          self.intended_devs[3]])])

        part, nodes = self.ring.get_nodes('a', 'c', 'o2')
        self.assertEquals(part, 2)
-        self.assertEquals(nodes, [self.intended_devs[0],
-                                  self.intended_devs[3]])
+        self.assertEquals(nodes, [dict(node, index=i) for i, node in
+                          enumerate([self.intended_devs[0],
+                          self.intended_devs[3]])])

    def add_dev_to_ring(self, new_dev):
        self.ring.devs.append(new_dev)
--- a/test/unit/common/test_utils.py
+++ b/test/unit/common/test_utils.py
@ -2190,13 +2190,14 @@ cluster_dfw1 = http://dfw1.host/v1/
        self.assertFalse(utils.streq_const_time('a', 'aaaaa'))
        self.assertFalse(utils.streq_const_time('ABC123', 'abc123'))

-    def test_quorum_size(self):
+    def test_replication_quorum_size(self):
        expected_sizes = {1: 1,
                          2: 2,
                          3: 2,
                          4: 3,
                          5: 3}
-        got_sizes = dict([(n, utils.quorum_size(n)) for n in expected_sizes])
+        got_sizes = dict([(n, utils.quorum_size(n))
+                          for n in expected_sizes])
        self.assertEqual(expected_sizes, got_sizes)

    def test_rsync_ip_ipv4_localhost(self):
@ -4593,6 +4594,22 @@ class TestLRUCache(unittest.TestCase):
        self.assertEqual(f.size(), 4)


+class TestParseContentRange(unittest.TestCase):
+    def test_good(self):
+        start, end, total = utils.parse_content_range("bytes 100-200/300")
+        self.assertEqual(start, 100)
+        self.assertEqual(end, 200)
+        self.assertEqual(total, 300)
+
+    def test_bad(self):
+        self.assertRaises(ValueError, utils.parse_content_range,
+                          "100-300/500")
+        self.assertRaises(ValueError, utils.parse_content_range,
+                          "bytes 100-200/aardvark")
+        self.assertRaises(ValueError, utils.parse_content_range,
+                          "bytes bulbous-bouffant/4994801")
+
+
 class TestParseContentDisposition(unittest.TestCase):

    def test_basic_content_type(self):
@ -4622,7 +4639,8 @@ class TestIterMultipartMimeDocuments(unittest.TestCase):
            it.next()
        except MimeInvalid as err:
            exc = err
-        self.assertEquals(str(exc), 'invalid starting boundary')
+        self.assertTrue('invalid starting boundary' in str(exc))
+        self.assertTrue('--unique' in str(exc))

    def test_empty(self):
        it = utils.iter_multipart_mime_documents(StringIO('--unique'),
--- a/test/unit/proxy/controllers/test_base.py
+++ b/test/unit/proxy/controllers/test_base.py
@ -21,9 +21,11 @@ from swift.proxy.controllers.base import headers_to_container_info, \
    headers_to_account_info, headers_to_object_info, get_container_info, \
    get_container_memcache_key, get_account_info, get_account_memcache_key, \
    get_object_env_key, get_info, get_object_info, \
-    Controller, GetOrHeadHandler, _set_info_cache, _set_object_info_cache
+    Controller, GetOrHeadHandler, _set_info_cache, _set_object_info_cache, \
+    bytes_to_skip
 from swift.common.swob import Request, HTTPException, HeaderKeyDict, \
    RESPONSE_REASONS
+from swift.common import exceptions
 from swift.common.utils import split_path
 from swift.common.http import is_success
 from swift.common.storage_policy import StoragePolicy
@ -159,9 +161,11 @@ class TestFuncs(unittest.TestCase):
    def test_GETorHEAD_base(self):
        base = Controller(self.app)
        req = Request.blank('/v1/a/c/o/with/slashes')
+        ring = FakeRing()
+        nodes = list(ring.get_part_nodes(0)) + list(ring.get_more_nodes(0))
        with patch('swift.proxy.controllers.base.'
                   'http_connect', fake_http_connect(200)):
-            resp = base.GETorHEAD_base(req, 'object', FakeRing(), 'part',
+            resp = base.GETorHEAD_base(req, 'object', iter(nodes), 'part',
                                       '/a/c/o/with/slashes')
        self.assertTrue('swift.object/a/c/o/with/slashes' in resp.environ)
        self.assertEqual(
@ -169,14 +173,14 @@ class TestFuncs(unittest.TestCase):
        req = Request.blank('/v1/a/c/o')
        with patch('swift.proxy.controllers.base.'
                   'http_connect', fake_http_connect(200)):
-            resp = base.GETorHEAD_base(req, 'object', FakeRing(), 'part',
+            resp = base.GETorHEAD_base(req, 'object', iter(nodes), 'part',
                                       '/a/c/o')
        self.assertTrue('swift.object/a/c/o' in resp.environ)
        self.assertEqual(resp.environ['swift.object/a/c/o']['status'], 200)
        req = Request.blank('/v1/a/c')
        with patch('swift.proxy.controllers.base.'
                   'http_connect', fake_http_connect(200)):
-            resp = base.GETorHEAD_base(req, 'container', FakeRing(), 'part',
+            resp = base.GETorHEAD_base(req, 'container', iter(nodes), 'part',
                                       '/a/c')
        self.assertTrue('swift.container/a/c' in resp.environ)
        self.assertEqual(resp.environ['swift.container/a/c']['status'], 200)
@ -184,7 +188,7 @@ class TestFuncs(unittest.TestCase):
        req = Request.blank('/v1/a')
        with patch('swift.proxy.controllers.base.'
                   'http_connect', fake_http_connect(200)):
-            resp = base.GETorHEAD_base(req, 'account', FakeRing(), 'part',
+            resp = base.GETorHEAD_base(req, 'account', iter(nodes), 'part',
                                       '/a')
        self.assertTrue('swift.account/a' in resp.environ)
        self.assertEqual(resp.environ['swift.account/a']['status'], 200)
@ -546,7 +550,7 @@ class TestFuncs(unittest.TestCase):
            resp,
            headers_to_object_info(headers.items(), 200))

-    def test_have_quorum(self):
+    def test_base_have_quorum(self):
        base = Controller(self.app)
        # just throw a bunch of test cases at it
        self.assertEqual(base.have_quorum([201, 404], 3), False)
@ -648,3 +652,88 @@ class TestFuncs(unittest.TestCase):
            self.assertEqual(v, dst_headers[k.lower()])
        for k, v in bad_hdrs.iteritems():
            self.assertFalse(k.lower() in dst_headers)
+
+    def test_client_chunk_size(self):
+
+        class TestSource(object):
+            def __init__(self, chunks):
+                self.chunks = list(chunks)
+
+            def read(self, _read_size):
+                if self.chunks:
+                    return self.chunks.pop(0)
+                else:
+                    return ''
+
+        source = TestSource((
+            'abcd', '1234', 'abc', 'd1', '234abcd1234abcd1', '2'))
+        req = Request.blank('/v1/a/c/o')
+        node = {}
+        handler = GetOrHeadHandler(self.app, req, None, None, None, None, {},
+                                   client_chunk_size=8)
+
+        app_iter = handler._make_app_iter(req, node, source)
+        client_chunks = list(app_iter)
+        self.assertEqual(client_chunks, [
+            'abcd1234', 'abcd1234', 'abcd1234', 'abcd12'])
+
+    def test_client_chunk_size_resuming(self):
+
+        class TestSource(object):
+            def __init__(self, chunks):
+                self.chunks = list(chunks)
+
+            def read(self, _read_size):
+                if self.chunks:
+                    chunk = self.chunks.pop(0)
+                    if chunk is None:
+                        raise exceptions.ChunkReadTimeout()
+                    else:
+                        return chunk
+                else:
+                    return ''
+
+        node = {'ip': '1.2.3.4', 'port': 6000, 'device': 'sda'}
+
+        source1 = TestSource(['abcd', '1234', 'abc', None])
+        source2 = TestSource(['efgh5678'])
+        req = Request.blank('/v1/a/c/o')
+        handler = GetOrHeadHandler(
+            self.app, req, 'Object', None, None, None, {},
+            client_chunk_size=8)
+
+        app_iter = handler._make_app_iter(req, node, source1)
+        with patch.object(handler, '_get_source_and_node',
+                          lambda: (source2, node)):
+            client_chunks = list(app_iter)
+        self.assertEqual(client_chunks, ['abcd1234', 'efgh5678'])
+        self.assertEqual(handler.backend_headers['Range'], 'bytes=8-')
+
+    def test_bytes_to_skip(self):
+        # if you start at the beginning, skip nothing
+        self.assertEqual(bytes_to_skip(1024, 0), 0)
+
+        # missed the first 10 bytes, so we've got 1014 bytes of partial
+        # record
+        self.assertEqual(bytes_to_skip(1024, 10), 1014)
+
+        # skipped some whole records first
+        self.assertEqual(bytes_to_skip(1024, 4106), 1014)
+
+        # landed on a record boundary
+        self.assertEqual(bytes_to_skip(1024, 1024), 0)
+        self.assertEqual(bytes_to_skip(1024, 2048), 0)
+
+        # big numbers
+        self.assertEqual(bytes_to_skip(2 ** 20, 2 ** 32), 0)
+        self.assertEqual(bytes_to_skip(2 ** 20, 2 ** 32 + 1), 2 ** 20 - 1)
+        self.assertEqual(bytes_to_skip(2 ** 20, 2 ** 32 + 2 ** 19), 2 ** 19)
+
+        # odd numbers
+        self.assertEqual(bytes_to_skip(123, 0), 0)
+        self.assertEqual(bytes_to_skip(123, 23), 100)
+        self.assertEqual(bytes_to_skip(123, 247), 122)
+
+        # prime numbers
+        self.assertEqual(bytes_to_skip(11, 7), 4)
+        self.assertEqual(bytes_to_skip(97, 7873823), 55)
--- a/test/unit/proxy/controllers/test_obj.py
+++ b/test/unit/proxy/controllers/test_obj.py
--- a/test/unit/proxy/test_server.py
+++ b/test/unit/proxy/test_server.py