Update EC overview doc for PUT path

Update the EC overview docs 'under the hood' section to reflect the
change in durable file parity from 2 to ec_nparity + 1 [1].

Also fix some typos and cleanup the text.

[1] change id I80d666f61273e589d0990baa78fd657b3470785d

Change-Id: I23f6299da59ba8357da2bb5976d879d9a4bb173e
This commit is contained in:
Alistair Coles 2015-09-30 09:45:57 +01:00
parent fbbe942041
commit 167f3c8cbd

View File

@ -407,18 +407,28 @@ is required for a few different reasons.
MIME supports a conversation between the proxy and the storage nodes for every
PUT. This provides us with the ability to handle a PUT in one connection and
assure that we have the essence of a 2 phase commit, basically having the proxy
communicate back to the storage nodes once it has confirmation that all fragment
archives in the set have been committed. Note that we still require a quorum of
data elements of the conversation to complete before signaling status to the
client but we can relax that requirement for the commit phase such that only 2
confirmations to that phase of the conversation are required for success as the
reconstructor will assure propagation of markers that indicate data durability.
communicate back to the storage nodes once it has confirmation that a quorum of
fragment archives in the set have been written.
This provides the storage node with a cheap indicator of the last known durable
set of fragment archives for a given object on a successful durable PUT, this is
known as the ``.durable`` file. The presence of a ``.durable`` file means, to
the object server, `there is a set of ts.data files that are durable at
timestamp ts.` Note that the completion of the commit phase of the conversation
For the first phase of the conversation the proxy requires a quorum of
`ec_ndata + 1` fragment archives to be successfully put to storage nodes.
This ensures that the object could still be reconstructed even if one of the
fragment archives becomes unavailable. During the second phase of the
conversation the proxy communicates a confirmation to storage nodes that the
fragment archive quorum has been achieved. This causes the storage node to
create a `ts.durable` file at timestamp `ts` which acts as an indicator of
the last known durable set of fragment archives for a given object. The
presence of a `ts.durable` file means, to the object server, `there is a set
of ts.data files that are durable at timestamp ts`.
For the second phase of the conversation the proxy requires a quorum of
`ec_nparity + 1` successful commits on storage nodes. This ensures that for
as long there are sufficient (i.e. `ec_ndata`) fragment archives to reconstruct
the object then there will be a `.durable` file on at least one storage node.
The reconstructor ensures that `.durable` files are replicated on storage nodes
where they may be missing.
Note that the completion of the commit phase of the conversation
is also a signal for the object server to go ahead and immediately delete older
timestamp files for this object. This is critical as we do not want to delete
the older object until the storage node has confirmation from the proxy, via the
@ -435,12 +445,9 @@ The basic flow looks like this:
* Upon receipt of commit message, object servers store a 0-byte data file as
`<timestamp>.durable` indicating successful PUT, and send a final response to
the proxy server.
* The proxy waits for a minimal number of two object servers to respond with a
* The proxy waits for `ec_nparity + 1` object servers to respond with a
success (2xx) status before responding to the client with a successful
status. In this particular case it was decided that two responses was
the minimum amount to know that the file would be propagated in case of
failure from other others and because a greater number would potentially
mean more latency, which should be avoided if possible.
status.
Here is a high level example of what the conversation looks like::
@ -495,7 +502,7 @@ to the nodes.
The more interesting case is what happens if the proxy dies in the middle of a
conversation. If it turns out that a quorum had been met and the commit phase
of the conversation finished, its as simple as the previous case in that the
reconstructor will repair things. However, if the commit didn't get a change to
reconstructor will repair things. However, if the commit didn't get a chance to
happen then some number of the storage nodes have .data files on them (fragment
archives) but none of them knows whether there are enough elsewhere for the
entire object to be reconstructed. In this case the client will not have
@ -535,12 +542,12 @@ which includes things like the entire object etag.
DiskFile
========
Erasure code uses subclassed ``ECDiskFile``, ``ECDiskFileWriter`` and
``ECDiskFileManager`` to impement EC specific handling of on disk files. This
includes things like file name manipulation to include the fragment index in the
filename, determination of valid .data files based on .durable presence,
construction of EC specific hashes.pkl file to include fragment index
information, etc., etc.
Erasure code uses subclassed ``ECDiskFile``, ``ECDiskFileWriter``,
``ECDiskFileReader`` and ``ECDiskFileManager`` to implement EC specific
handling of on disk files. This includes things like file name manipulation to
include the fragment index in the filename, determination of valid .data files
based on .durable presence, construction of EC specific hashes.pkl file to
include fragment index information, etc., etc.
Metadata
--------