Update EC overview doc for PUT path
Update the EC overview docs 'under the hood' section to reflect the change in durable file parity from 2 to ec_nparity + 1 [1]. Also fix some typos and cleanup the text. [1] change id I80d666f61273e589d0990baa78fd657b3470785d Change-Id: I23f6299da59ba8357da2bb5976d879d9a4bb173e
This commit is contained in:
parent
fbbe942041
commit
167f3c8cbd
@ -407,18 +407,28 @@ is required for a few different reasons.
|
||||
MIME supports a conversation between the proxy and the storage nodes for every
|
||||
PUT. This provides us with the ability to handle a PUT in one connection and
|
||||
assure that we have the essence of a 2 phase commit, basically having the proxy
|
||||
communicate back to the storage nodes once it has confirmation that all fragment
|
||||
archives in the set have been committed. Note that we still require a quorum of
|
||||
data elements of the conversation to complete before signaling status to the
|
||||
client but we can relax that requirement for the commit phase such that only 2
|
||||
confirmations to that phase of the conversation are required for success as the
|
||||
reconstructor will assure propagation of markers that indicate data durability.
|
||||
communicate back to the storage nodes once it has confirmation that a quorum of
|
||||
fragment archives in the set have been written.
|
||||
|
||||
This provides the storage node with a cheap indicator of the last known durable
|
||||
set of fragment archives for a given object on a successful durable PUT, this is
|
||||
known as the ``.durable`` file. The presence of a ``.durable`` file means, to
|
||||
the object server, `there is a set of ts.data files that are durable at
|
||||
timestamp ts.` Note that the completion of the commit phase of the conversation
|
||||
For the first phase of the conversation the proxy requires a quorum of
|
||||
`ec_ndata + 1` fragment archives to be successfully put to storage nodes.
|
||||
This ensures that the object could still be reconstructed even if one of the
|
||||
fragment archives becomes unavailable. During the second phase of the
|
||||
conversation the proxy communicates a confirmation to storage nodes that the
|
||||
fragment archive quorum has been achieved. This causes the storage node to
|
||||
create a `ts.durable` file at timestamp `ts` which acts as an indicator of
|
||||
the last known durable set of fragment archives for a given object. The
|
||||
presence of a `ts.durable` file means, to the object server, `there is a set
|
||||
of ts.data files that are durable at timestamp ts`.
|
||||
|
||||
For the second phase of the conversation the proxy requires a quorum of
|
||||
`ec_nparity + 1` successful commits on storage nodes. This ensures that for
|
||||
as long there are sufficient (i.e. `ec_ndata`) fragment archives to reconstruct
|
||||
the object then there will be a `.durable` file on at least one storage node.
|
||||
The reconstructor ensures that `.durable` files are replicated on storage nodes
|
||||
where they may be missing.
|
||||
|
||||
Note that the completion of the commit phase of the conversation
|
||||
is also a signal for the object server to go ahead and immediately delete older
|
||||
timestamp files for this object. This is critical as we do not want to delete
|
||||
the older object until the storage node has confirmation from the proxy, via the
|
||||
@ -435,12 +445,9 @@ The basic flow looks like this:
|
||||
* Upon receipt of commit message, object servers store a 0-byte data file as
|
||||
`<timestamp>.durable` indicating successful PUT, and send a final response to
|
||||
the proxy server.
|
||||
* The proxy waits for a minimal number of two object servers to respond with a
|
||||
* The proxy waits for `ec_nparity + 1` object servers to respond with a
|
||||
success (2xx) status before responding to the client with a successful
|
||||
status. In this particular case it was decided that two responses was
|
||||
the minimum amount to know that the file would be propagated in case of
|
||||
failure from other others and because a greater number would potentially
|
||||
mean more latency, which should be avoided if possible.
|
||||
status.
|
||||
|
||||
Here is a high level example of what the conversation looks like::
|
||||
|
||||
@ -495,7 +502,7 @@ to the nodes.
|
||||
The more interesting case is what happens if the proxy dies in the middle of a
|
||||
conversation. If it turns out that a quorum had been met and the commit phase
|
||||
of the conversation finished, its as simple as the previous case in that the
|
||||
reconstructor will repair things. However, if the commit didn't get a change to
|
||||
reconstructor will repair things. However, if the commit didn't get a chance to
|
||||
happen then some number of the storage nodes have .data files on them (fragment
|
||||
archives) but none of them knows whether there are enough elsewhere for the
|
||||
entire object to be reconstructed. In this case the client will not have
|
||||
@ -535,12 +542,12 @@ which includes things like the entire object etag.
|
||||
DiskFile
|
||||
========
|
||||
|
||||
Erasure code uses subclassed ``ECDiskFile``, ``ECDiskFileWriter`` and
|
||||
``ECDiskFileManager`` to impement EC specific handling of on disk files. This
|
||||
includes things like file name manipulation to include the fragment index in the
|
||||
filename, determination of valid .data files based on .durable presence,
|
||||
construction of EC specific hashes.pkl file to include fragment index
|
||||
information, etc., etc.
|
||||
Erasure code uses subclassed ``ECDiskFile``, ``ECDiskFileWriter``,
|
||||
``ECDiskFileReader`` and ``ECDiskFileManager`` to implement EC specific
|
||||
handling of on disk files. This includes things like file name manipulation to
|
||||
include the fragment index in the filename, determination of valid .data files
|
||||
based on .durable presence, construction of EC specific hashes.pkl file to
|
||||
include fragment index information, etc., etc.
|
||||
|
||||
Metadata
|
||||
--------
|
||||
|
Loading…
Reference in New Issue
Block a user