Merge "Global EC Under Development Documentation"
This commit is contained in:
commit
b7e0494be2
@ -2,9 +2,9 @@
|
||||
Erasure Code Support
|
||||
====================
|
||||
|
||||
-------------------------------
|
||||
*******************************
|
||||
History and Theory of Operation
|
||||
-------------------------------
|
||||
*******************************
|
||||
|
||||
There's a lot of good material out there on Erasure Code (EC) theory, this short
|
||||
introduction is just meant to provide some basic context to help the reader
|
||||
@ -36,9 +36,8 @@ details about their differences are well beyond the scope of this introduction,
|
||||
but we will talk more about a few of them when we get into the implementation of
|
||||
EC in Swift.
|
||||
|
||||
--------------------------------
|
||||
Overview of EC Support in Swift
|
||||
--------------------------------
|
||||
================================
|
||||
|
||||
First and foremost, from an application perspective EC support is totally
|
||||
transparent. There are no EC related external API; a container is simply created
|
||||
@ -79,9 +78,8 @@ external library allows for maximum flexibility as there are a significant
|
||||
number of options out there, each with its owns pros and cons that can vary
|
||||
greatly from one use case to another.
|
||||
|
||||
---------------------------------------
|
||||
PyECLib: External Erasure Code Library
|
||||
---------------------------------------
|
||||
=======================================
|
||||
|
||||
PyECLib is a Python Erasure Coding Library originally designed and written as
|
||||
part of the effort to add EC support to the Swift project, however it is an
|
||||
@ -107,9 +105,8 @@ requirement.
|
||||
|
||||
For complete details see `PyECLib <https://github.com/openstack/pyeclib>`_
|
||||
|
||||
------------------------------
|
||||
Storing and Retrieving Objects
|
||||
------------------------------
|
||||
==============================
|
||||
|
||||
We will discuss the details of how PUT and GET work in the "Under the Hood"
|
||||
section later on. The key point here is that all of the erasure code work goes
|
||||
@ -139,9 +136,8 @@ file system. Although it is true that more files will be stored (because an
|
||||
object is broken into pieces), the implementation works to minimize this where
|
||||
possible, more details are available in the Under the Hood section.
|
||||
|
||||
-------------
|
||||
Handoff Nodes
|
||||
-------------
|
||||
=============
|
||||
|
||||
In EC policies, similarly to replication, handoff nodes are a set of storage
|
||||
nodes used to augment the list of primary nodes responsible for storing an
|
||||
@ -149,9 +145,8 @@ erasure coded object. These handoff nodes are used in the event that one or more
|
||||
of the primaries are unavailable. Handoff nodes are still selected with an
|
||||
attempt to achieve maximum separation of the data being placed.
|
||||
|
||||
--------------
|
||||
Reconstruction
|
||||
--------------
|
||||
==============
|
||||
|
||||
For an EC policy, reconstruction is analogous to the process of replication for
|
||||
a replication type policy -- essentially "the reconstructor" replaces "the
|
||||
@ -178,9 +173,9 @@ similar to that of replication with a few notable exceptions:
|
||||
replication, reconstruction can be the result of rebalancing, bit-rot, drive
|
||||
failure or reverting data from a hand-off node back to its primary.
|
||||
|
||||
--------------------------
|
||||
**************************
|
||||
Performance Considerations
|
||||
--------------------------
|
||||
**************************
|
||||
|
||||
In general, EC has different performance characteristics than replicated data.
|
||||
EC requires substantially more CPU to read and write data, and is more suited
|
||||
@ -189,9 +184,9 @@ for larger objects that are not frequently accessed (eg backups).
|
||||
Operators are encouraged to characterize the performance of various EC schemes
|
||||
and share their observations with the developer community.
|
||||
|
||||
----------------------------
|
||||
****************************
|
||||
Using an Erasure Code Policy
|
||||
----------------------------
|
||||
****************************
|
||||
|
||||
To use an EC policy, the administrator simply needs to define an EC policy in
|
||||
`swift.conf` and create/configure the associated object ring. An example of how
|
||||
@ -205,11 +200,6 @@ an EC policy can be setup is shown below::
|
||||
ec_num_parity_fragments = 4
|
||||
ec_object_segment_size = 1048576
|
||||
|
||||
# Duplicated EC fragments is proof-of-concept experimental support to enable
|
||||
# Global Erasure Coding policies with multiple regions acting as independent
|
||||
# failure domains. Do not change the default except in development/testing.
|
||||
ec_duplication_factor = 1
|
||||
|
||||
Let's take a closer look at each configuration parameter:
|
||||
|
||||
* ``name``: This is a standard storage policy parameter.
|
||||
@ -228,11 +218,6 @@ Let's take a closer look at each configuration parameter:
|
||||
comprised of parity.
|
||||
* ``ec_object_segment_size``: The amount of data that will be buffered up before
|
||||
feeding a segment into the encoder/decoder. The default value is 1048576.
|
||||
* ``ec_duplication_factor``: The number of duplicate copies for each fragment.
|
||||
This is now experimental support to enable Global Erasure Coding policies with
|
||||
multiple regions. Do not change the default except in development/testing. And
|
||||
please read the "EC Duplication" section below before changing the default
|
||||
value.
|
||||
|
||||
When PyECLib encodes an object, it will break it into N fragments. However, what
|
||||
is important during configuration, is how many of those are data and how many
|
||||
@ -253,8 +238,8 @@ associated with the ring; ``replicas`` must be equal to the sum of
|
||||
|
||||
swift-ring-builder object-1.builder create 10 14 1
|
||||
|
||||
Note that in this example the ``replicas`` value of 14 is based on the sum of
|
||||
10 EC data fragments and 4 EC parity fragments.
|
||||
Note that in this example the ``replicas`` value of ``14`` is based on the sum of
|
||||
``10`` EC data fragments and ``4`` EC parity fragments.
|
||||
|
||||
Once you have configured your EC policy in `swift.conf` and created your object
|
||||
ring, your application is ready to start using EC simply by creating a container
|
||||
@ -268,7 +253,7 @@ with the specified policy name and interacting as usual.
|
||||
and migrate the data to a new container.
|
||||
|
||||
Migrating Between Policies
|
||||
--------------------------
|
||||
==========================
|
||||
|
||||
A common usage of EC is to migrate less commonly accessed data from a more
|
||||
expensive but lower latency policy such as replication. When an application
|
||||
@ -276,110 +261,166 @@ determines that it wants to move data from a replication policy to an EC policy,
|
||||
it simply needs to move the data from the replicated container to an EC
|
||||
container that was created with the target durability policy.
|
||||
|
||||
Region Support
|
||||
--------------
|
||||
|
||||
For at least the initial version of EC, it is not recommended that an EC scheme
|
||||
span beyond a single region, neither performance nor functional validation has
|
||||
be been done in such a configuration.
|
||||
*********
|
||||
Global EC
|
||||
*********
|
||||
|
||||
Since the initial release of EC, it has not been recommended that an EC scheme
|
||||
span beyond a single region. Initial performance and functional validation has
|
||||
shown that using sufficiently large parity schemas to ensure availability
|
||||
across regions is inefficient, and rebalance is unoptimized across high latency
|
||||
bandwidth constrained WANs.
|
||||
|
||||
Region support for EC polices is under development! `EC Duplication` provides
|
||||
a foundation for this.
|
||||
|
||||
EC Duplication
|
||||
==============
|
||||
|
||||
ec_duplication_factor is an option to make duplicate copies of fragments
|
||||
of erasure encoded Swift objects. The default value is 1 (not duplicate).
|
||||
If an erasure code storage policy is configured with a non-default
|
||||
ec_duplication_factor of N > 1, then the policy will create N duplicates of
|
||||
each unique fragment that is returned from the configured EC engine.
|
||||
.. warning::
|
||||
|
||||
EC Duplication is an experimental feature that has some serious known
|
||||
issues which make it currently unsuitable for use in production.
|
||||
|
||||
EC Duplication enables Swift to make duplicated copies of fragments of erasure
|
||||
coded objects. If an EC storage policy is configured with a non-default
|
||||
``ec_duplication_factor`` of ``N > 1``, then the policy will create ``N``
|
||||
duplicates of each unique fragment that is returned from the configured EC
|
||||
engine.
|
||||
|
||||
Duplication of EC fragments is optimal for EC storage policies which require
|
||||
dispersion of fragment data across failure domains. Without duplication, almost
|
||||
of common ec parameters like 10-4 cause less assignments than 1/(the number
|
||||
of failure domains) of the total unique fragments. And usually, it will be less
|
||||
than the number of data fragments which are required to construct the original
|
||||
data. To guarantee the number of fragments in a failure domain, the system
|
||||
requires more parities. On the situation which needs more parity, empirical
|
||||
testing has shown using duplication is more efficient in the PUT path than
|
||||
encoding a schema with num_parity > num_data, and Swift EC supports this schema.
|
||||
You should evaluate which strategy works best in your environment.
|
||||
dispersion of fragment data across failure domains. Without duplication, common
|
||||
EC parameters will not distribute enough unique fragments between large failure
|
||||
domains to allow for a rebuild using fragments from any one domain. For
|
||||
example a uniformly distributed ``10+4`` EC policy schema would place 7
|
||||
fragments in each of two failure domains, which is less in each failure domain
|
||||
than the 10 fragments needed to rebuild a missing fragment.
|
||||
|
||||
e.g. 10-4 and duplication factor of 2 will store 28 fragments (i.e.
|
||||
(``ec_num_data_fragments`` + ``ec_num_parity_fragments``) *
|
||||
``ec_duplication_factor``). This \*can\* allow for a failure domain to rebuild
|
||||
an object to full durability even when \*more\* than 14 fragments are
|
||||
unavailable.
|
||||
Without duplication support, an EC policy schema must be adjusted to include
|
||||
additional parity fragments in order to guarantee the number of fragments in
|
||||
each failure domain is greater than the number required to rebuild. For
|
||||
example, a uniformally distributed ``10+18`` EC policy schema would place 14
|
||||
fragments in each of two failure domains, which is more than sufficient in each
|
||||
failure domain to rebuild a missing fragment. However, empirical testing has
|
||||
shown encoding a schema with ``num_parity > num_data`` (such as ``10+18``) is
|
||||
less efficient than using duplication of fragments. EC fragment duplication
|
||||
enables Swift's Global EC to maintain more independence between failure domains
|
||||
without sacrificing efficiency on read/write or rebuild!
|
||||
|
||||
.. note::
|
||||
The ``ec_duplication_factor`` option may be configured in `swift.conf` in each
|
||||
``storage-policy`` section. The option may be omitted - the default value is
|
||||
``1`` (i.e. no duplication)::
|
||||
|
||||
Current EC Duplication is a part work of EC region support so we still
|
||||
have some known issues to get complete region supports:
|
||||
[storage-policy:2]
|
||||
name = ec104
|
||||
policy_type = erasure_coding
|
||||
ec_type = liberasurecode_rs_vand
|
||||
ec_num_data_fragments = 10
|
||||
ec_num_parity_fragments = 4
|
||||
ec_object_segment_size = 1048576
|
||||
ec_duplication_factor = 2
|
||||
|
||||
Known-Issues:
|
||||
.. warning::
|
||||
|
||||
- Unique fragment dispersion
|
||||
The ``ec_duplication_factor`` option should only be set for experimental
|
||||
and development purposes. EC Duplication is an experimental feature that
|
||||
has some serious known issues which make it currently unsuitable for use in
|
||||
production.
|
||||
|
||||
Currently, Swift \*doesn't\* guarantee the dispersion of unique
|
||||
fragments' locations in the global distributed cluster being robust
|
||||
in the disaster recovery case. While the goal is to have duplicates
|
||||
of each unique fragment placed in each region, it is currently
|
||||
possible for duplicates of the same unique fragment to be placed in
|
||||
the same region. Since a set of ``ec_num_data_fragments`` unique
|
||||
fragments is required to reconstruct an object, the suboptimal
|
||||
distribution of duplicates across regions may, in some cases, make it
|
||||
impossible to assemble such a set from a single region.
|
||||
In this example, a ``10+4`` schema and a duplication factor of ``2`` will
|
||||
result in ``(10+4)x2 = 28`` fragments being stored (we will use the shorthand
|
||||
``10+4x2`` to denote that policy configuration) . The ring for this policy
|
||||
should be configured with 28 replicas (i.e. ``(ec_num_data_fragments +
|
||||
ec_num_parity_fragments) * ec_duplication_factor``). A ``10+4x2`` schema
|
||||
**can** allow a multi-region deployment to rebuild an object to full durability
|
||||
even when *more* than 14 fragments are unavailable. This is advantageous with
|
||||
respect to a ``10+18`` configuration not only because reads from data fragments
|
||||
will be more common and more efficient, but also because a ``10+4x2`` can grow
|
||||
into a ``10+4x3`` to expand into another region.
|
||||
|
||||
For example, if we have a Swift cluster with 2 regions, the fragments may
|
||||
be located like as:
|
||||
Known Issues
|
||||
============
|
||||
|
||||
::
|
||||
Unique Fragment Dispersion
|
||||
--------------------------
|
||||
|
||||
r1
|
||||
#0#d.data
|
||||
#0#d.data
|
||||
#2#d.data
|
||||
#2#d.data
|
||||
#4#d.data
|
||||
#4#d.data
|
||||
r2
|
||||
#1#d.data
|
||||
#1#d.data
|
||||
#3#d.data
|
||||
#3#d.data
|
||||
#5#d.data
|
||||
#5#d.data
|
||||
Currently, Swift's ring placement does **not** guarantee the dispersion of
|
||||
fragments' locations being robust to disaster recovery in the case
|
||||
of Global EC. While the goal is to have one duplicate of each
|
||||
fragment placed in each region, it is currently possible for duplicates of
|
||||
the same fragment to be placed in the same region (and consequently for
|
||||
another region to have no duplicates of that fragment). Since a set of
|
||||
``ec_num_data_fragments`` unique fragments is required to reconstruct an
|
||||
object, a suboptimal distribution of duplicates across regions may, in some
|
||||
cases, make it impossible to assemble such a set from a single region.
|
||||
|
||||
In this case, r1 has only the fragments with index 0, 2, 4 and r2 has
|
||||
the rest of indexes but we need 4 unique indexes to decode. To resolve
|
||||
the case, the composite ring which enables the operator oriented location
|
||||
mapping [1] is under development.
|
||||
For example, if we have a Swift cluster with two regions, ``r1`` and ``r2``,
|
||||
the 12 fragments for an object in a ``4+2x2`` EC policy schema could have
|
||||
pathologically sub-optimal placement::
|
||||
|
||||
1: https://review.openstack.org/#/c/271920/
|
||||
r1
|
||||
<timestamp>#0#d.data
|
||||
<timestamp>#0#d.data
|
||||
<timestamp>#2#d.data
|
||||
<timestamp>#2#d.data
|
||||
<timestamp>#4#d.data
|
||||
<timestamp>#4#d.data
|
||||
r2
|
||||
<timestamp>#1#d.data
|
||||
<timestamp>#1#d.data
|
||||
<timestamp>#3#d.data
|
||||
<timestamp>#3#d.data
|
||||
<timestamp>#5#d.data
|
||||
<timestamp>#5#d.data
|
||||
|
||||
- Efficient node iteration for read
|
||||
In this case, ``r1`` has only the fragments with index ``0, 2, 4`` and ``r2``
|
||||
has the other 3 indexes, but we need 4 unique indexes to be able to rebuild an
|
||||
object in a single region. To resolve this issue, a composite ring feature is
|
||||
being developed which will provide the operator with greater control over
|
||||
duplicate fragment placement::
|
||||
|
||||
Since EC fragment duplication requires a set of unique fragment indexes
|
||||
to decode the original object, it needs efficient node iteration rather
|
||||
than current. Current Swift is iterating the nodes ordered by sorting
|
||||
method defined in proxy server config. (i.e. either shuffle, node_timing,
|
||||
or read_affinity) However, the sorted result could include duplicate
|
||||
indexes for the first primaries to try to connect even if \*we\* know
|
||||
it obviously needs more nodes to get unique fragments. Hence, current
|
||||
Swift may call more backend requests than ec_ndata times frequently even
|
||||
if no node failures in the object-servers.
|
||||
https://review.openstack.org/#/c/271920/
|
||||
|
||||
The possible solution could be some refactoring work on NodeIter to
|
||||
provide suitable nodes even if it's fragment duplication but it's still
|
||||
under development yet.
|
||||
Efficient Node Selection for Read
|
||||
---------------------------------
|
||||
|
||||
Since EC policies requires a set of unique fragment indexes to decode the
|
||||
original object, it is increasingly likely with EC duplication that some
|
||||
responses from backend storage nodes will include fragments which the proxy has
|
||||
already received from another node. Currently Swift iterates over the nodes
|
||||
ordered by a sorting method defined in the proxy server config (i.e. either
|
||||
shuffle, node_timing, or read_affinity) - but these configurations will
|
||||
not offer optimal request patterns for EC policies with duplicated
|
||||
fragments. In this case Swift may frequently issue more than the optimal
|
||||
``ec_num_data_fragments`` backend requests in order to gather
|
||||
``ec_num_data_fragments`` **unique** fragments, even if there are no failures
|
||||
amongst the object-servers.
|
||||
|
||||
--------------
|
||||
In addition to better placement and read affinity support, ideally node
|
||||
iteration for EC duplication policies could predict which nodes are likely
|
||||
to hold duplicates and prioritize requests to the most suitable nodes.
|
||||
|
||||
Efficient Cross Region Rebuild
|
||||
------------------------------
|
||||
|
||||
Since fragments are duplicated between regions it may in some cases be more
|
||||
attractive to restore failed fragments from their duplicates in another region
|
||||
instead of rebuilding them from other fragments in the local region.
|
||||
Conversely to avoid WAN transfer it may be more attractive to rebuild fragments
|
||||
from local parity. During rebalance it will always be more attractive to
|
||||
revert a fragment from it's old-primary to it's new primary rather than
|
||||
rebuilding or transferring a duplicate from the remote region.
|
||||
|
||||
**************
|
||||
Under the Hood
|
||||
--------------
|
||||
**************
|
||||
|
||||
Now that we've explained a little about EC support in Swift and how to
|
||||
configure/use it, let's explore how EC fits in at the nuts-n-bolts level.
|
||||
configure and use it, let's explore how EC fits in at the nuts-n-bolts level.
|
||||
|
||||
Terminology
|
||||
-----------
|
||||
===========
|
||||
|
||||
The term 'fragment' has been used already to describe the output of the EC
|
||||
process (a series of fragments) however we need to define some other key terms
|
||||
@ -399,7 +440,7 @@ correct terms consistently, it is very easy to get confused in a hurry!
|
||||
* **ec_nparity**: Number of EC parity fragments.
|
||||
|
||||
Middleware
|
||||
----------
|
||||
==========
|
||||
|
||||
Middleware remains unchanged. For most middleware (e.g., SLO/DLO) the fact that
|
||||
the proxy is fragmenting incoming objects is transparent. For list endpoints,
|
||||
@ -409,7 +450,7 @@ original object with this information, however the node locations may still
|
||||
prove to be useful information for some applications.
|
||||
|
||||
On Disk Storage
|
||||
---------------
|
||||
===============
|
||||
|
||||
EC archives are stored on disk in their respective objects-N directory based on
|
||||
their policy index. See :doc:`overview_policies` for details on per policy
|
||||
@ -455,10 +496,10 @@ The transformation function for the replication policy is simply a NOP.
|
||||
|
||||
|
||||
Proxy Server
|
||||
------------
|
||||
============
|
||||
|
||||
High Level
|
||||
==========
|
||||
----------
|
||||
|
||||
The Proxy Server handles Erasure Coding in a different manner than replication,
|
||||
therefore there are several code paths unique to EC policies either though sub
|
||||
@ -480,7 +521,7 @@ This scheme makes it possible to minimize the number of on-disk files given our
|
||||
segmenting and fragmenting.
|
||||
|
||||
Multi_Phase Conversation
|
||||
========================
|
||||
------------------------
|
||||
|
||||
Multi-part MIME document support is used to allow the proxy to engage in a
|
||||
handshake conversation with the storage node for processing PUT requests. This
|
||||
@ -584,7 +625,7 @@ A few key points on the durable state of a fragment archive:
|
||||
returning the object to the client.
|
||||
|
||||
Partial PUT Failures
|
||||
====================
|
||||
--------------------
|
||||
|
||||
A partial PUT failure has a few different modes. In one scenario the Proxy
|
||||
Server is alive through the entire PUT conversation. This is a very
|
||||
@ -607,7 +648,7 @@ however, for the current release, a proxy failure after the start of a
|
||||
conversation but before the commit message will simply result in a PUT failure.
|
||||
|
||||
GET
|
||||
===
|
||||
---
|
||||
|
||||
The GET for EC is different enough from replication that subclassing the
|
||||
`BaseObjectController` to the `ECObjectController` enables an efficient way to
|
||||
@ -648,7 +689,7 @@ ensures that it has sufficient EC archives with the same timestamp
|
||||
and distinct fragment indexes before considering a GET to be successful.
|
||||
|
||||
Object Server
|
||||
-------------
|
||||
=============
|
||||
|
||||
The Object Server, like the Proxy Server, supports MIME conversations as
|
||||
described in the proxy section earlier. This includes processing of the commit
|
||||
@ -656,7 +697,7 @@ message and decoding various sections of the MIME document to extract the footer
|
||||
which includes things like the entire object etag.
|
||||
|
||||
DiskFile
|
||||
========
|
||||
--------
|
||||
|
||||
Erasure code policies use subclassed ``ECDiskFile``, ``ECDiskFileWriter``,
|
||||
``ECDiskFileReader`` and ``ECDiskFileManager`` to implement EC specific
|
||||
@ -665,7 +706,7 @@ include the fragment index and durable state in the filename, construction of
|
||||
EC specific ``hashes.pkl`` file to include fragment index information, etc.
|
||||
|
||||
Metadata
|
||||
--------
|
||||
^^^^^^^^
|
||||
|
||||
There are few different categories of metadata that are associated with EC:
|
||||
|
||||
@ -689,13 +730,13 @@ PyECLib Metadata: PyECLib stores a small amount of metadata on a per fragment
|
||||
basis. This metadata is not documented here as it is opaque to Swift.
|
||||
|
||||
Database Updates
|
||||
----------------
|
||||
================
|
||||
|
||||
As account and container rings are not associated with a Storage Policy, there
|
||||
is no change to how these database updates occur when using an EC policy.
|
||||
|
||||
The Reconstructor
|
||||
-----------------
|
||||
=================
|
||||
|
||||
The Reconstructor performs analogous functions to the replicator:
|
||||
|
||||
@ -720,7 +761,7 @@ situations can be pretty complex so we will just focus on what the
|
||||
reconstructor does here and not a detailed explanation of why.
|
||||
|
||||
Job Construction and Processing
|
||||
===============================
|
||||
-------------------------------
|
||||
|
||||
Because of the nature of the work it has to do as described above, the
|
||||
reconstructor builds jobs for a single job processor. The job itself contains
|
||||
@ -761,7 +802,7 @@ Job construction must account for a variety of scenarios, including:
|
||||
partition list.
|
||||
|
||||
Node Communication
|
||||
==================
|
||||
------------------
|
||||
|
||||
The replicators talk to all nodes who have a copy of their object, typically
|
||||
just 2 other nodes. For EC, having each reconstructor node talk to all nodes
|
||||
@ -771,7 +812,7 @@ built to talk to its adjacent nodes on the ring only. These nodes are typically
|
||||
referred to as partners.
|
||||
|
||||
Reconstruction
|
||||
==============
|
||||
--------------
|
||||
|
||||
Reconstruction can be thought of sort of like replication but with an extra step
|
||||
in the middle. The reconstructor is hard-wired to use ssync to determine what is
|
||||
@ -799,7 +840,7 @@ over. The sender is then responsible for deleting the objects as they are sent
|
||||
in the case of data reversion.
|
||||
|
||||
The Auditor
|
||||
-----------
|
||||
===========
|
||||
|
||||
Because the auditor already operates on a per storage policy basis, there are no
|
||||
specific auditor changes associated with EC. Each EC archive looks like, and is
|
||||
|
Loading…
x
Reference in New Issue
Block a user