01f9d15045
Update the doc to reflect the change [1] to ndata + 1 .durable files being committed before a success response is returned for a PUT. [1] Ifd36790faa0a5d00ec79c23d1f96a332a0ca0f0b Change-Id: I1744d457bda8a52eb2451029c4031962e92c2bb7
692 lines
32 KiB
ReStructuredText
Executable File
692 lines
32 KiB
ReStructuredText
Executable File
====================
|
|
Erasure Code Support
|
|
====================
|
|
|
|
|
|
--------------------------
|
|
Beta: Not production ready
|
|
--------------------------
|
|
The erasure code support in Swift is considered "beta" at this point.
|
|
Most major functionality is included, but it has not been tested or validated
|
|
at large scale. This feature relies on ssync for durability. Deployers are
|
|
urged to do extensive testing and not deploy production data using an
|
|
erasure code storage policy.
|
|
|
|
If any bugs are found during testing, please report them to
|
|
https://bugs.launchpad.net/swift
|
|
|
|
|
|
-------------------------------
|
|
History and Theory of Operation
|
|
-------------------------------
|
|
|
|
There's a lot of good material out there on Erasure Code (EC) theory, this short
|
|
introduction is just meant to provide some basic context to help the reader
|
|
better understand the implementation in Swift.
|
|
|
|
Erasure Coding for storage applications grew out of Coding Theory as far back as
|
|
the 1960s with the Reed-Solomon codes. These codes have been used for years in
|
|
applications ranging from CDs to DVDs to general communications and, yes, even
|
|
in the space program starting with Voyager! The basic idea is that some amount
|
|
of data is broken up into smaller pieces called fragments and coded in such a
|
|
way that it can be transmitted with the ability to tolerate the loss of some
|
|
number of the coded fragments. That's where the word "erasure" comes in, if you
|
|
transmit 14 fragments and only 13 are received then one of them is said to be
|
|
"erased". The word "erasure" provides an important distinction with EC; it
|
|
isn't about detecting errors, it's about dealing with failures. Another
|
|
important element of EC is that the number of erasures that can be tolerated can
|
|
be adjusted to meet the needs of the application.
|
|
|
|
At a high level EC works by using a specific scheme to break up a single data
|
|
buffer into several smaller data buffers then, depending on the scheme,
|
|
performing some encoding operation on that data in order to generate additional
|
|
information. So you end up with more data than you started with and that extra
|
|
data is often called "parity". Note that there are many, many different
|
|
encoding techniques that vary both in how they organize and manipulate the data
|
|
as well by what means they use to calculate parity. For example, one scheme
|
|
might rely on `Galois Field Arithmetic <http://www.ssrc.ucsc.edu/Papers/plank-
|
|
fast13.pdf>`_ while others may work with only XOR. The number of variations and
|
|
details about their differences are well beyond the scope of this introduction,
|
|
but we will talk more about a few of them when we get into the implementation of
|
|
EC in Swift.
|
|
|
|
--------------------------------
|
|
Overview of EC Support in Swift
|
|
--------------------------------
|
|
|
|
First and foremost, from an application perspective EC support is totally
|
|
transparent. There are no EC related external API; a container is simply created
|
|
using a Storage Policy defined to use EC and then interaction with the cluster
|
|
is the same as any other durability policy.
|
|
|
|
EC is implemented in Swift as a Storage Policy, see :doc:`overview_policies` for
|
|
complete details on Storage Policies. Because support is implemented as a
|
|
Storage Policy, all of the storage devices associated with your cluster's EC
|
|
capability can be isolated. It is entirely possible to share devices between
|
|
storage policies, but for EC it may make more sense to not only use separate
|
|
devices but possibly even entire nodes dedicated for EC.
|
|
|
|
Which direction one chooses depends on why the EC policy is being deployed. If,
|
|
for example, there is a production replication policy in place already and the
|
|
goal is to add a cold storage tier such that the existing nodes performing
|
|
replication are impacted as little as possible, adding a new set of nodes
|
|
dedicated to EC might make the most sense but also incurs the most cost. On the
|
|
other hand, if EC is being added as a capability to provide additional
|
|
durability for a specific set of applications and the existing infrastructure is
|
|
well suited for EC (sufficient number of nodes, zones for the EC scheme that is
|
|
chosen) then leveraging the existing infrastructure such that the EC ring shares
|
|
nodes with the replication ring makes the most sense. These are some of the
|
|
main considerations:
|
|
|
|
* Layout of existing infrastructure.
|
|
* Cost of adding dedicated EC nodes (or just dedicated EC devices).
|
|
* Intended usage model(s).
|
|
|
|
The Swift code base does not include any of the algorithms necessary to perform
|
|
the actual encoding and decoding of data; that is left to external libraries.
|
|
The Storage Policies architecture is leveraged to enable EC on a per container
|
|
basis -- the object rings are still used to determine the placement of EC data
|
|
fragments. Although there are several code paths that are unique to an operation
|
|
associated with an EC policy, an external dependency to an Erasure Code library
|
|
is what Swift counts on to perform the low level EC functions. The use of an
|
|
external library allows for maximum flexibility as there are a significant
|
|
number of options out there, each with its owns pros and cons that can vary
|
|
greatly from one use case to another.
|
|
|
|
---------------------------------------
|
|
PyECLib: External Erasure Code Library
|
|
---------------------------------------
|
|
|
|
PyECLib is a Python Erasure Coding Library originally designed and written as
|
|
part of the effort to add EC support to the Swift project, however it is an
|
|
independent project. The library provides a well-defined and simple Python
|
|
interface and internally implements a plug-in architecture allowing it to take
|
|
advantage of many well-known C libraries such as:
|
|
|
|
* Jerasure and GFComplete at http://jerasure.org.
|
|
* Intel(R) ISA-L at http://01.org/intel%C2%AE-storage-acceleration-library-open-source-version.
|
|
* Or write your own!
|
|
|
|
PyECLib uses a C based library called liberasurecode to implement the plug in
|
|
infrastructure; liberasure code is available at:
|
|
|
|
* liberasurecode: https://bitbucket.org/tsg-/liberasurecode
|
|
|
|
PyECLib itself therefore allows for not only choice but further extensibility as
|
|
well. PyECLib also comes with a handy utility to help determine the best
|
|
algorithm to use based on the equipment that will be used (processors and server
|
|
configurations may vary in performance per algorithm). More on this will be
|
|
covered in the configuration section. PyECLib is included as a Swift
|
|
requirement.
|
|
|
|
For complete details see `PyECLib <https://bitbucket.org/kmgreen2/pyeclib>`_
|
|
|
|
------------------------------
|
|
Storing and Retrieving Objects
|
|
------------------------------
|
|
|
|
We will discuss the details of how PUT and GET work in the "Under the Hood"
|
|
section later on. The key point here is that all of the erasure code work goes
|
|
on behind the scenes; this summary is a high level information overview only.
|
|
|
|
The PUT flow looks like this:
|
|
|
|
#. The proxy server streams in an object and buffers up "a segment" of data
|
|
(size is configurable).
|
|
#. The proxy server calls on PyECLib to encode the data into smaller fragments.
|
|
#. The proxy streams the encoded fragments out to the storage nodes based on
|
|
ring locations.
|
|
#. Repeat until the client is done sending data.
|
|
#. The client is notified of completion when a quorum is met.
|
|
|
|
The GET flow looks like this:
|
|
|
|
#. The proxy server makes simultaneous requests to participating nodes.
|
|
#. As soon as the proxy has the fragments it needs, it calls on PyECLib to
|
|
decode the data.
|
|
#. The proxy streams the decoded data it has back to the client.
|
|
#. Repeat until the proxy is done sending data back to the client.
|
|
|
|
It may sound like, from this high level overview, that using EC is going to
|
|
cause an explosion in the number of actual files stored in each node's local
|
|
file system. Although it is true that more files will be stored (because an
|
|
object is broken into pieces), the implementation works to minimize this where
|
|
possible, more details are available in the Under the Hood section.
|
|
|
|
-------------
|
|
Handoff Nodes
|
|
-------------
|
|
|
|
In EC policies, similarly to replication, handoff nodes are a set of storage
|
|
nodes used to augment the list of primary nodes responsible for storing an
|
|
erasure coded object. These handoff nodes are used in the event that one or more
|
|
of the primaries are unavailable. Handoff nodes are still selected with an
|
|
attempt to achieve maximum separation of the data being placed.
|
|
|
|
--------------
|
|
Reconstruction
|
|
--------------
|
|
|
|
For an EC policy, reconstruction is analogous to the process of replication for
|
|
a replication type policy -- essentially "the reconstructor" replaces "the
|
|
replicator" for EC policy types. The basic framework of reconstruction is very
|
|
similar to that of replication with a few notable exceptions:
|
|
|
|
* Because EC does not actually replicate partitions, it needs to operate at a
|
|
finer granularity than what is provided with rsync, therefore EC leverages
|
|
much of ssync behind the scenes (you do not need to manually configure ssync).
|
|
* Once a pair of nodes has determined the need to replace a missing object
|
|
fragment, instead of pushing over a copy like replication would do, the
|
|
reconstructor has to read in enough surviving fragments from other nodes and
|
|
perform a local reconstruction before it has the correct data to push to the
|
|
other node.
|
|
* A reconstructor does not talk to all other reconstructors in the set of nodes
|
|
responsible for an EC partition, this would be far too chatty, instead each
|
|
reconstructor is responsible for sync'ing with the partition's closest two
|
|
neighbors (closest meaning left and right on the ring).
|
|
|
|
.. note::
|
|
|
|
EC work (encode and decode) takes place both on the proxy nodes, for PUT/GET
|
|
operations, as well as on the storage nodes for reconstruction. As with
|
|
replication, reconstruction can be the result of rebalancing, bit-rot, drive
|
|
failure or reverting data from a hand-off node back to its primary.
|
|
|
|
--------------------------
|
|
Performance Considerations
|
|
--------------------------
|
|
|
|
Efforts are underway to characterize performance of various Erasure Code
|
|
schemes. One of the main goals of the beta release is to perform this
|
|
characterization and encourage others to do so and provide meaningful feedback
|
|
to the development community. There are many factors that will affect
|
|
performance of EC so it is vital that we have multiple characterization
|
|
activities happening.
|
|
|
|
In general, EC has different performance characteristics than replicated data.
|
|
EC requires substantially more CPU to read and write data, and is more suited
|
|
for larger objects that are not frequently accessed (eg backups).
|
|
|
|
----------------------------
|
|
Using an Erasure Code Policy
|
|
----------------------------
|
|
|
|
To use an EC policy, the administrator simply needs to define an EC policy in
|
|
`swift.conf` and create/configure the associated object ring. An example of how
|
|
an EC policy can be setup is shown below::
|
|
|
|
[storage-policy:2]
|
|
name = ec104
|
|
policy_type = erasure_coding
|
|
ec_type = jerasure_rs_vand
|
|
ec_num_data_fragments = 10
|
|
ec_num_parity_fragments = 4
|
|
ec_object_segment_size = 1048576
|
|
|
|
Let's take a closer look at each configuration parameter:
|
|
|
|
* ``name``: This is a standard storage policy parameter.
|
|
See :doc:`overview_policies` for details.
|
|
* ``policy_type``: Set this to ``erasure_coding`` to indicate that this is an EC
|
|
policy.
|
|
* ``ec_type``: Set this value according to the available options in the selected
|
|
PyECLib back-end. This specifies the EC scheme that is to be used. For
|
|
example the option shown here selects Vandermonde Reed-Solomon encoding while
|
|
an option of ``flat_xor_hd_3`` would select Flat-XOR based HD combination
|
|
codes. See the `PyECLib <https://bitbucket.org/kmgreen2/pyeclib>`_ page for
|
|
full details.
|
|
* ``ec_num_data_fragments``: The total number of fragments that will be
|
|
comprised of data.
|
|
* ``ec_num_parity_fragments``: The total number of fragments that will be
|
|
comprised of parity.
|
|
* ``ec_object_segment_size``: The amount of data that will be buffered up before
|
|
feeding a segment into the encoder/decoder. The default value is 1048576.
|
|
|
|
When PyECLib encodes an object, it will break it into N fragments. However, what
|
|
is important during configuration, is how many of those are data and how many
|
|
are parity. So in the example above, PyECLib will actually break an object in
|
|
14 different fragments, 10 of them will be made up of actual object data and 4
|
|
of them will be made of parity data (calculations depending on ec_type).
|
|
|
|
When deciding which devices to use in the EC policy's object ring, be sure to
|
|
carefully consider the performance impacts. Running some performance
|
|
benchmarking in a test environment for your configuration is highly recommended
|
|
before deployment.
|
|
|
|
To create the EC policy's object ring, the only difference in the usage of the
|
|
``swift-ring-builder create`` command is the ``replicas`` parameter. The
|
|
``replicas`` value is the number of fragments spread across the object servers
|
|
associated with the ring; ``replicas`` must be equal to the sum of
|
|
``ec_num_data_fragments`` and ``ec_num_parity_fragments``. For example::
|
|
|
|
swift-ring-builder object-1.builder create 10 14 1
|
|
|
|
Note that in this example the ``replicas`` value of 14 is based on the sum of
|
|
10 EC data fragments and 4 EC parity fragments.
|
|
|
|
Once you have configured your EC policy in `swift.conf` and created your object
|
|
ring, your application is ready to start using EC simply by creating a container
|
|
with the specified policy name and interacting as usual.
|
|
|
|
.. note::
|
|
|
|
It's important to note that once you have deployed a policy and have created
|
|
objects with that policy, these configurations options cannot be changed. In
|
|
case a change in the configuration is desired, you must create a new policy
|
|
and migrate the data to a new container.
|
|
|
|
Migrating Between Policies
|
|
--------------------------
|
|
|
|
A common usage of EC is to migrate less commonly accessed data from a more
|
|
expensive but lower latency policy such as replication. When an application
|
|
determines that it wants to move data from a replication policy to an EC policy,
|
|
it simply needs to move the data from the replicated container to an EC
|
|
container that was created with the target durability policy.
|
|
|
|
Region Support
|
|
--------------
|
|
|
|
For at least the initial version of EC, it is not recommended that an EC scheme
|
|
span beyond a single region, neither performance nor functional validation has
|
|
be been done in such a configuration.
|
|
|
|
--------------
|
|
Under the Hood
|
|
--------------
|
|
|
|
Now that we've explained a little about EC support in Swift and how to
|
|
configure/use it, let's explore how EC fits in at the nuts-n-bolts level.
|
|
|
|
Terminology
|
|
-----------
|
|
|
|
The term 'fragment' has been used already to describe the output of the EC
|
|
process (a series of fragments) however we need to define some other key terms
|
|
here before going any deeper. Without paying special attention to using the
|
|
correct terms consistently, it is very easy to get confused in a hurry!
|
|
|
|
* **chunk**: HTTP chunks received over wire (term not used to describe any EC
|
|
specific operation).
|
|
* **segment**: Not to be confused with SLO/DLO use of the word, in EC we call a
|
|
segment a series of consecutive HTTP chunks buffered up before performing an
|
|
EC operation.
|
|
* **fragment**: Data and parity 'fragments' are generated when erasure coding
|
|
transformation is applied to a segment.
|
|
* **EC archive**: A concatenation of EC fragments; to a storage node this looks
|
|
like an object.
|
|
* **ec_ndata**: Number of EC data fragments.
|
|
* **ec_nparity**: Number of EC parity fragments.
|
|
|
|
Middleware
|
|
----------
|
|
|
|
Middleware remains unchanged. For most middleware (e.g., SLO/DLO) the fact that
|
|
the proxy is fragmenting incoming objects is transparent. For list endpoints,
|
|
however, it is a bit different. A caller of list endpoints will get back the
|
|
locations of all of the fragments. The caller will be unable to re-assemble the
|
|
original object with this information, however the node locations may still
|
|
prove to be useful information for some applications.
|
|
|
|
On Disk Storage
|
|
---------------
|
|
|
|
EC archives are stored on disk in their respective objects-N directory based on
|
|
their policy index. See :doc:`overview_policies` for details on per policy
|
|
directory information.
|
|
|
|
The actual names on disk of EC archives also have one additional piece of data
|
|
encoded in the filename, the fragment archive index.
|
|
|
|
Each storage policy now must include a transformation function that diskfile
|
|
will use to build the filename to store on disk. The functions are implemented
|
|
in the diskfile module as policy specific sub classes ``DiskFileManager``.
|
|
|
|
This is required for a few reasons. For one, it allows us to store fragment
|
|
archives of different indexes on the same storage node which is not typical
|
|
however it is possible in many circumstances. Without unique filenames for the
|
|
different EC archive files in a set, we would be at risk of overwriting one
|
|
archive of index n with another of index m in some scenarios.
|
|
|
|
The transformation function for the replication policy is simply a NOP. For
|
|
reconstruction, the index is appended to the filename just before the .data
|
|
extension. An example filename for a fragment archive storing the 5th fragment
|
|
would like this this::
|
|
|
|
1418673556.92690#5.data
|
|
|
|
An additional file is also included for Erasure Code policies called the
|
|
``.durable`` file. Its meaning will be covered in detail later, however, its on-
|
|
disk format does not require the name transformation function that was just
|
|
covered. The .durable for the example above would simply look like this::
|
|
|
|
1418673556.92690.durable
|
|
|
|
And it would be found alongside every fragment specific .data file following a
|
|
100% successful PUT operation.
|
|
|
|
Proxy Server
|
|
------------
|
|
|
|
High Level
|
|
==========
|
|
|
|
The Proxy Server handles Erasure Coding in a different manner than replication,
|
|
therefore there are several code paths unique to EC policies either though sub
|
|
classing or simple conditionals. Taking a closer look at the PUT and the GET
|
|
paths will help make this clearer. But first, a high level overview of how an
|
|
object flows through the system:
|
|
|
|
.. image:: images/ec_overview.png
|
|
|
|
Note how:
|
|
|
|
* Incoming objects are buffered into segments at the proxy.
|
|
* Segments are erasure coded into fragments at the proxy.
|
|
* The proxy stripes fragments across participating nodes such that the on-disk
|
|
stored files that we call a fragment archive is appended with each new
|
|
fragment.
|
|
|
|
This scheme makes it possible to minimize the number of on-disk files given our
|
|
segmenting and fragmenting.
|
|
|
|
Multi_Phase Conversation
|
|
========================
|
|
|
|
Multi-part MIME document support is used to allow the proxy to engage in a
|
|
handshake conversation with the storage node for processing PUT requests. This
|
|
is required for a few different reasons.
|
|
|
|
#. From the perspective of the storage node, a fragment archive is really just
|
|
another object, we need a mechanism to send down the original object etag
|
|
after all fragment archives have landed.
|
|
#. Without introducing strong consistency semantics, the proxy needs a mechanism
|
|
to know when a quorum of fragment archives have actually made it to disk
|
|
before it can inform the client of a successful PUT.
|
|
|
|
MIME supports a conversation between the proxy and the storage nodes for every
|
|
PUT. This provides us with the ability to handle a PUT in one connection and
|
|
assure that we have the essence of a 2 phase commit, basically having the proxy
|
|
communicate back to the storage nodes once it has confirmation that a quorum of
|
|
fragment archives in the set have been written.
|
|
|
|
For the first phase of the conversation the proxy requires a quorum of
|
|
`ec_ndata + 1` fragment archives to be successfully put to storage nodes.
|
|
This ensures that the object could still be reconstructed even if one of the
|
|
fragment archives becomes unavailable. During the second phase of the
|
|
conversation the proxy communicates a confirmation to storage nodes that the
|
|
fragment archive quorum has been achieved. This causes the storage node to
|
|
create a `ts.durable` file at timestamp `ts` which acts as an indicator of
|
|
the last known durable set of fragment archives for a given object. The
|
|
presence of a `ts.durable` file means, to the object server, `there is a set
|
|
of ts.data files that are durable at timestamp ts`.
|
|
|
|
For the second phase of the conversation the proxy requires a quorum of
|
|
`ec_ndata + 1` successful commits on storage nodes. This ensures that there are
|
|
sufficient committed fragment archives for the object to be reconstructed even
|
|
if one becomes unavailable. The reconstructor ensures that `.durable` files are
|
|
replicated on storage nodes where they may be missing.
|
|
|
|
Note that the completion of the commit phase of the conversation
|
|
is also a signal for the object server to go ahead and immediately delete older
|
|
timestamp files for this object. This is critical as we do not want to delete
|
|
the older object until the storage node has confirmation from the proxy, via the
|
|
multi-phase conversation, that the other nodes have landed enough for a quorum.
|
|
|
|
The basic flow looks like this:
|
|
|
|
* The Proxy Server erasure codes and streams the object fragments
|
|
(ec_ndata + ec_nparity) to the storage nodes.
|
|
* The storage nodes store objects as EC archives and upon finishing object
|
|
data/metadata write, send a 1st-phase response to proxy.
|
|
* Upon quorum of storage nodes responses, the proxy initiates 2nd-phase by
|
|
sending commit confirmations to object servers.
|
|
* Upon receipt of commit message, object servers store a 0-byte data file as
|
|
`<timestamp>.durable` indicating successful PUT, and send a final response to
|
|
the proxy server.
|
|
* The proxy waits for `ec_ndata + 1` object servers to respond with a
|
|
success (2xx) status before responding to the client with a successful
|
|
status.
|
|
|
|
Here is a high level example of what the conversation looks like::
|
|
|
|
proxy: PUT /p/a/c/o
|
|
Transfer-Encoding': 'chunked'
|
|
Expect': '100-continue'
|
|
X-Backend-Obj-Multiphase-Commit: yes
|
|
obj: 100 Continue
|
|
X-Obj-Multiphase-Commit: yes
|
|
proxy: --MIMEboundary
|
|
X-Document: object body
|
|
<obj_data>
|
|
--MIMEboundary
|
|
X-Document: object metadata
|
|
Content-MD5: <footer_meta_cksum>
|
|
<footer_meta>
|
|
--MIMEboundary
|
|
<object server writes data, metadata>
|
|
obj: 100 Continue
|
|
<quorum>
|
|
proxy: X-Document: put commit
|
|
commit_confirmation
|
|
--MIMEboundary--
|
|
<object server writes ts.durable state>
|
|
obj: 20x
|
|
<proxy waits to receive >=2 2xx responses>
|
|
proxy: 2xx -> client
|
|
|
|
A few key points on the .durable file:
|
|
|
|
* The .durable file means \"the matching .data file for this has sufficient
|
|
fragment archives somewhere, committed, to reconstruct the object\".
|
|
* The Proxy Server will never have knowledge, either on GET or HEAD, of the
|
|
existence of a .data file on an object server if it does not have a matching
|
|
.durable file.
|
|
* The object server will never return a .data that does not have a matching
|
|
.durable.
|
|
* When a proxy does a GET, it will only receive fragment archives that have
|
|
enough present somewhere to be reconstructed.
|
|
|
|
Partial PUT Failures
|
|
====================
|
|
|
|
A partial PUT failure has a few different modes. In one scenario the Proxy
|
|
Server is alive through the entire PUT conversation. This is a very
|
|
straightforward case. The client will receive a good response if and only if a
|
|
quorum of fragment archives were successfully landed on their storage nodes. In
|
|
this case the Reconstructor will discover the missing fragment archives, perform
|
|
a reconstruction and deliver fragment archives and their matching .durable files
|
|
to the nodes.
|
|
|
|
The more interesting case is what happens if the proxy dies in the middle of a
|
|
conversation. If it turns out that a quorum had been met and the commit phase
|
|
of the conversation finished, its as simple as the previous case in that the
|
|
reconstructor will repair things. However, if the commit didn't get a chance to
|
|
happen then some number of the storage nodes have .data files on them (fragment
|
|
archives) but none of them knows whether there are enough elsewhere for the
|
|
entire object to be reconstructed. In this case the client will not have
|
|
received a 2xx response so there is no issue there, however, it is left to the
|
|
storage nodes to clean up the stale fragment archives. Work is ongoing in this
|
|
area to enable the proxy to play a role in reviving these fragment archives,
|
|
however, for the current release, a proxy failure after the start of a
|
|
conversation but before the commit message will simply result in a PUT failure.
|
|
|
|
GET
|
|
===
|
|
|
|
The GET for EC is different enough from replication that subclassing the
|
|
`BaseObjectController` to the `ECObjectController` enables an efficient way to
|
|
implement the high level steps described earlier:
|
|
|
|
#. The proxy server makes simultaneous requests to participating nodes.
|
|
#. As soon as the proxy has the fragments it needs, it calls on PyECLib to
|
|
decode the data.
|
|
#. The proxy streams the decoded data it has back to the client.
|
|
#. Repeat until the proxy is done sending data back to the client.
|
|
|
|
The GET path will attempt to contact all nodes participating in the EC scheme,
|
|
if not enough primaries respond then handoffs will be contacted just as with
|
|
replication. Etag and content length headers are updated for the client
|
|
response following reconstruction as the individual fragment archives metadata
|
|
is valid only for that fragment archive.
|
|
|
|
Object Server
|
|
-------------
|
|
|
|
The Object Server, like the Proxy Server, supports MIME conversations as
|
|
described in the proxy section earlier. This includes processing of the commit
|
|
message and decoding various sections of the MIME document to extract the footer
|
|
which includes things like the entire object etag.
|
|
|
|
DiskFile
|
|
========
|
|
|
|
Erasure code uses subclassed ``ECDiskFile``, ``ECDiskFileWriter``,
|
|
``ECDiskFileReader`` and ``ECDiskFileManager`` to implement EC specific
|
|
handling of on disk files. This includes things like file name manipulation to
|
|
include the fragment index in the filename, determination of valid .data files
|
|
based on .durable presence, construction of EC specific hashes.pkl file to
|
|
include fragment index information, etc., etc.
|
|
|
|
Metadata
|
|
--------
|
|
|
|
There are few different categories of metadata that are associated with EC:
|
|
|
|
System Metadata: EC has a set of object level system metadata that it
|
|
attaches to each of the EC archives. The metadata is for internal use only:
|
|
|
|
* ``X-Object-Sysmeta-EC-Etag``: The Etag of the original object.
|
|
* ``X-Object-Sysmeta-EC-Content-Length``: The content length of the original
|
|
object.
|
|
* ``X-Object-Sysmeta-EC-Frag-Index``: The fragment index for the object.
|
|
* ``X-Object-Sysmeta-EC-Scheme``: Description of the EC policy used to encode
|
|
the object.
|
|
* ``X-Object-Sysmeta-EC-Segment-Size``: The segment size used for the object.
|
|
|
|
User Metadata: User metadata is unaffected by EC, however, a full copy of the
|
|
user metadata is stored with every EC archive. This is required as the
|
|
reconstructor needs this information and each reconstructor only communicates
|
|
with its closest neighbors on the ring.
|
|
|
|
PyECLib Metadata: PyECLib stores a small amount of metadata on a per fragment
|
|
basis. This metadata is not documented here as it is opaque to Swift.
|
|
|
|
Database Updates
|
|
----------------
|
|
|
|
As account and container rings are not associated with a Storage Policy, there
|
|
is no change to how these database updates occur when using an EC policy.
|
|
|
|
The Reconstructor
|
|
-----------------
|
|
|
|
The Reconstructor performs analogous functions to the replicator:
|
|
|
|
#. Recovery from disk drive failure.
|
|
#. Moving data around because of a rebalance.
|
|
#. Reverting data back to a primary from a handoff.
|
|
#. Recovering fragment archives from bit rot discovered by the auditor.
|
|
|
|
However, under the hood it operates quite differently. The following are some
|
|
of the key elements in understanding how the reconstructor operates.
|
|
|
|
Unlike the replicator, the work that the reconstructor does is not always as
|
|
easy to break down into the 2 basic tasks of synchronize or revert (move data
|
|
from handoff back to primary) because of the fact that one storage node can
|
|
house fragment archives of various indexes and each index really /"belongs/" to
|
|
a different node. So, whereas when the replicator is reverting data from a
|
|
handoff it has just one node to send its data to, the reconstructor can have
|
|
several. Additionally, its not always the case that the processing of a
|
|
particular suffix directory means one or the other for the entire directory (as
|
|
it does for replication). The scenarios that create these mixed situations can
|
|
be pretty complex so we will just focus on what the reconstructor does here and
|
|
not a detailed explanation of why.
|
|
|
|
Job Construction and Processing
|
|
===============================
|
|
|
|
Because of the nature of the work it has to do as described above, the
|
|
reconstructor builds jobs for a single job processor. The job itself contains
|
|
all of the information needed for the processor to execute the job which may be
|
|
a synchronization or a data reversion and there may be a mix of jobs that
|
|
perform both of these operations on the same suffix directory.
|
|
|
|
Jobs are constructed on a per partition basis and then per fragment index basis.
|
|
That is, there will be one job for every fragment index in a partition.
|
|
Performing this construction \"up front\" like this helps minimize the
|
|
interaction between nodes collecting hashes.pkl information.
|
|
|
|
Once a set of jobs for a partition has been constructed, those jobs are sent off
|
|
to threads for execution. The single job processor then performs the necessary
|
|
actions working closely with ssync to carry out its instructions. For data
|
|
reversion, the actual objects themselves are cleaned up via the ssync module and
|
|
once that partition's set of jobs is complete, the reconstructor will attempt to
|
|
remove the relevant directory structures.
|
|
|
|
The scenarios that job construction has to take into account include:
|
|
|
|
#. A partition directory with all fragment indexes matching the local node
|
|
index. This is the case where everything is where it belongs and we just
|
|
need to compare hashes and sync if needed, here we sync with our partners.
|
|
#. A partition directory with one local fragment index and mix of others. Here
|
|
we need to sync with our partners where fragment indexes matches the
|
|
local_id, all others are sync'd with their home nodes and then deleted.
|
|
#. A partition directory with no local fragment index and just one or more of
|
|
others. Here we sync with just the home nodes for the fragment indexes that
|
|
we have and then all the local archives are deleted. This is the basic
|
|
handoff reversion case.
|
|
|
|
.. note::
|
|
A \"home node\" is the node where the fragment index encoded in the
|
|
fragment archive's filename matches the node index of a node in the primary
|
|
partition list.
|
|
|
|
Node Communication
|
|
==================
|
|
|
|
The replicators talk to all nodes who have a copy of their object, typically
|
|
just 2 other nodes. For EC, having each reconstructor node talk to all nodes
|
|
would incur a large amount of overhead as there will typically be a much larger
|
|
number of nodes participating in the EC scheme. Therefore, the reconstructor is
|
|
built to talk to its adjacent nodes on the ring only. These nodes are typically
|
|
referred to as partners.
|
|
|
|
Reconstruction
|
|
==============
|
|
|
|
Reconstruction can be thought of sort of like replication but with an extra step
|
|
in the middle. The reconstructor is hard-wired to use ssync to determine what is
|
|
missing and desired by the other side. However, before an object is sent over
|
|
the wire it needs to be reconstructed from the remaining fragments as the local
|
|
fragment is just that - a different fragment index than what the other end is
|
|
asking for.
|
|
|
|
Thus, there are hooks in ssync for EC based policies. One case would be for
|
|
basic reconstruction which, at a high level, looks like this:
|
|
|
|
* Determine which nodes need to be contacted to collect other EC archives needed
|
|
to perform reconstruction.
|
|
* Update the etag and fragment index metadata elements of the newly constructed
|
|
fragment archive.
|
|
* Establish a connection to the target nodes and give ssync a DiskFileLike class
|
|
that it can stream data from.
|
|
|
|
The reader in this class gathers fragments from the nodes and uses PyECLib to
|
|
reconstruct each segment before yielding data back to ssync. Essentially what
|
|
this means is that data is buffered, in memory, on a per segment basis at the
|
|
node performing reconstruction and each segment is dynamically reconstructed and
|
|
delivered to `ssync_sender` where the `send_put()` method will ship them on
|
|
over. The sender is then responsible for deleting the objects as they are sent
|
|
in the case of data reversion.
|
|
|
|
The Auditor
|
|
-----------
|
|
|
|
Because the auditor already operates on a per storage policy basis, there are no
|
|
specific auditor changes associated with EC. Each EC archive looks like, and is
|
|
treated like, a regular object from the perspective of the auditor. Therefore,
|
|
if the auditor finds bit-rot in an EC archive, it simply quarantines it and the
|
|
reconstructor will take care of the rest just as the replicator does for
|
|
replication policies.
|