diff --git a/doc/source/overview_erasure_code.rst b/doc/source/overview_erasure_code.rst index a8a8e87891..7b495cd887 100755 --- a/doc/source/overview_erasure_code.rst +++ b/doc/source/overview_erasure_code.rst @@ -407,18 +407,28 @@ is required for a few different reasons. MIME supports a conversation between the proxy and the storage nodes for every PUT. This provides us with the ability to handle a PUT in one connection and assure that we have the essence of a 2 phase commit, basically having the proxy -communicate back to the storage nodes once it has confirmation that all fragment -archives in the set have been committed. Note that we still require a quorum of -data elements of the conversation to complete before signaling status to the -client but we can relax that requirement for the commit phase such that only 2 -confirmations to that phase of the conversation are required for success as the -reconstructor will assure propagation of markers that indicate data durability. +communicate back to the storage nodes once it has confirmation that a quorum of +fragment archives in the set have been written. -This provides the storage node with a cheap indicator of the last known durable -set of fragment archives for a given object on a successful durable PUT, this is -known as the ``.durable`` file. The presence of a ``.durable`` file means, to -the object server, `there is a set of ts.data files that are durable at -timestamp ts.` Note that the completion of the commit phase of the conversation +For the first phase of the conversation the proxy requires a quorum of +`ec_ndata + 1` fragment archives to be successfully put to storage nodes. +This ensures that the object could still be reconstructed even if one of the +fragment archives becomes unavailable. During the second phase of the +conversation the proxy communicates a confirmation to storage nodes that the +fragment archive quorum has been achieved. This causes the storage node to +create a `ts.durable` file at timestamp `ts` which acts as an indicator of +the last known durable set of fragment archives for a given object. The +presence of a `ts.durable` file means, to the object server, `there is a set +of ts.data files that are durable at timestamp ts`. + +For the second phase of the conversation the proxy requires a quorum of +`ec_nparity + 1` successful commits on storage nodes. This ensures that for +as long there are sufficient (i.e. `ec_ndata`) fragment archives to reconstruct +the object then there will be a `.durable` file on at least one storage node. +The reconstructor ensures that `.durable` files are replicated on storage nodes +where they may be missing. + +Note that the completion of the commit phase of the conversation is also a signal for the object server to go ahead and immediately delete older timestamp files for this object. This is critical as we do not want to delete the older object until the storage node has confirmation from the proxy, via the @@ -435,12 +445,9 @@ The basic flow looks like this: * Upon receipt of commit message, object servers store a 0-byte data file as `.durable` indicating successful PUT, and send a final response to the proxy server. - * The proxy waits for a minimal number of two object servers to respond with a + * The proxy waits for `ec_nparity + 1` object servers to respond with a success (2xx) status before responding to the client with a successful - status. In this particular case it was decided that two responses was - the minimum amount to know that the file would be propagated in case of - failure from other others and because a greater number would potentially - mean more latency, which should be avoided if possible. + status. Here is a high level example of what the conversation looks like:: @@ -495,7 +502,7 @@ to the nodes. The more interesting case is what happens if the proxy dies in the middle of a conversation. If it turns out that a quorum had been met and the commit phase of the conversation finished, its as simple as the previous case in that the -reconstructor will repair things. However, if the commit didn't get a change to +reconstructor will repair things. However, if the commit didn't get a chance to happen then some number of the storage nodes have .data files on them (fragment archives) but none of them knows whether there are enough elsewhere for the entire object to be reconstructed. In this case the client will not have @@ -535,12 +542,12 @@ which includes things like the entire object etag. DiskFile ======== -Erasure code uses subclassed ``ECDiskFile``, ``ECDiskFileWriter`` and -``ECDiskFileManager`` to impement EC specific handling of on disk files. This -includes things like file name manipulation to include the fragment index in the -filename, determination of valid .data files based on .durable presence, -construction of EC specific hashes.pkl file to include fragment index -information, etc., etc. +Erasure code uses subclassed ``ECDiskFile``, ``ECDiskFileWriter``, +``ECDiskFileReader`` and ``ECDiskFileManager`` to implement EC specific +handling of on disk files. This includes things like file name manipulation to +include the fragment index in the filename, determination of valid .data files +based on .durable presence, construction of EC specific hashes.pkl file to +include fragment index information, etc., etc. Metadata --------