From be5cbb9bd2b9d2a87185c7de24d8d4ceaf6b2b80 Mon Sep 17 00:00:00 2001
From: John Griffith <john.griffith8@gmail.com>
Date: Fri, 29 Jan 2016 09:39:04 -0500
Subject: [PATCH] Spec for Cheesecake approach to replication

Scale back a bit again, nail this one before M since nothing has
released on old version and move on to Tiramisu.

Change-Id: I9fc6870ea657906e9ff35b3134e6b61bf69a4193
---
 specs/mitaka/cheesecake.rst | 435 ++++++++++++++++++++++++++++++++++++
 1 file changed, 435 insertions(+)
 create mode 100644 specs/mitaka/cheesecake.rst

diff --git a/specs/mitaka/cheesecake.rst b/specs/mitaka/cheesecake.rst
new file mode 100644
index 00000000..d8ceca34
--- /dev/null
+++ b/specs/mitaka/cheesecake.rst
@@ -0,0 +1,435 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+==========================================
+Cheesecake
+==========================================
+
+Include the URL of your launchpad blueprint:
+
+https://blueprints.launchpad.net/cinder/+spec/replication
+
+This spec proposes further refinement to the Cinder replication.
+After more vendors have tried to implement replication and we've
+learned more lessons about the differences in backends and their
+semantics we've decided we should step back and look at simplifying
+this even further.
+
+The goal of the new design is to address a large amount of confusion
+and differences in interpretation.  Rather than try and cover multiple
+use cases in the first iteration, this spec aims to address a single
+fairly well defined use case.  Then we can iterate and move on from
+there.
+
+Problem description
+===================
+The existing design is great for some backends, but is challenging for many
+devices to fit in to.  It's also filled with pitfalls with the question of
+managed/unmanaged, not to mention trying to deal with failing over some
+volumes and leaving others. The concept of failing over on a volume basis
+instead of on a device basis while nice for testing doesn't fit well into
+the intended use case and results in quite a bit of complexity and also is not
+something that a number of backends can even support.
+
+Use Cases
+=========
+This is intended to be a DR mechanism.  The model Use Case is a catastrophic
+event occurring on the backend storage device, but some or all volumes that
+were on the primary backend may have been replicated to another backend device
+in which case those volumes may still be accessible.
+
+The flow of events are as follows:
+1.  Admin configures a backend device to enable replication.  We have a
+configured cinder backend just as always (Backend-A) but we add config
+options for a replication target (Backend-B).
+
+    a.  We no longer deal with differentiation between managed and unmanaged
+        Now, to enable a replication target(s), the replication_target entry
+        is the ONLY method allowed and is specified as a section in the driver.
+
+    b.  Depending on the back-end device enabling this may mean that EVERY
+        volume created on the device is replicated, or for those that have
+        the capability and if admins choose to do so a Volume-Type of
+        "replicated=True" can be created and used by tenants.
+
+    Note that if the backend only supports replicating "all" volumes, or
+    if the Admin wants to set things up so that "all" volumes are
+    replicated that the Type creation may or may not be necessary.
+
+2.  Tenant creates a Volume that is replicated (either by specifying
+    appropriate Type, or by the nature of the backend device)
+    Result in this example is a Volume we'll call "Foo"
+
+3.  Backend-A is caught in the crossfire of a water balloon fight that
+    shouldn't have been taking place in the data center, and looses it's magic
+    smoke, "It's dead Jim!"
+
+4.  Admin issues "cinder replication-failover" command with possible arguments
+    a.  Call propagates to Cinder Driver, which performs appropriate steps for
+    that driver to now point to the secondary (target) device (Backend-B).
+
+    b.  The Service Table in Cinder's database is updated to indicate that a
+        replication failover event has occurred, and the driver is currently
+        pointing to an alternate target  device.
+
+    In this scenario volumes that were replicated should still be accessible by
+    tenants.  The usage may or may not be restricted depending on options
+    provided in the failover command.  If no restrictions are set we expect to
+    be able to continue using them as we would prior to the failure event.
+
+    Volumes that were attached/in-use are a special case in this scenario and
+    will require additional steps.  The Tenant will be required in this case to
+    detach the volumes from any instances manually.  Cinder does not have the
+    ability to call Nova's volume-detach methods, so this has to be done by the
+    Tenant or the Admin.
+
+    c. Freeze option provided as an argument to Failover
+       The failover command includes a "freeze" option.  This option indicates
+       that a volume may still be read or written to, HOWEVER that we will not
+       allow any additional resource create or delete options until an admin
+       issues a "thaw" command.  This means that attempts to call
+       snapshot-create, xxx-delete, resize, retype etc should return an
+       InvalidCommand error.  This is intended to try and keep things in as
+       stable of a state as possible, to help in recovering from the
+       catastrophic event.  We think of this as the backend resources becoming
+       ReadOnly from a management/control plan perspective.  This does not mean
+       you can't R/W IO from an instance to the volume.
+
+5.  How to get back to "normal"
+    a.  If the original backend device is salvageable, the failover command
+    should be used to switch back to the original primary device.  This of
+    course means that there should be some mechanism on the backend and
+    operations performed by the Admin that ensures the resources still exist on
+    the Primary (Backend-A) and that their data is updated based on what may
+    have been written while they were hosted on Backend-B.  This indicates that
+    for backends to support this something like 2-way replication is going to
+    be required.  For backends that can't support this, it's likely that we'll
+    need to instead swap the primary and secondary configuration info
+    (Reconfigure making Backend-B the Primary).
+
+
+
+It's important to emphasize, if the volume is not of type "replicated" it will
+NOT be accessible after the failover.  This approach fails over the entire
+backend to another device.
+
+
+Proposed change
+===============
+
+One of the goals of this patch is to try and eliminate some of the challenges
+with the differences between manage and unmanaged replication targets.  In this
+model we make this easier for backends.  Rather than having some volumes on
+one backend and some on another and not doing things like stats update, we now
+fail over the entire backend including stats updates and everything.
+
+This does mean that non-replicated type volumes will be left behind and
+inaccessible (unavailable), that's an expectation in this use case (the
+device burst into flames).  We should treat these volumes just like we
+currently treat volumes in a scenario where somebody disconnects a backend.
+That's essentially what is happening here and it's no different really.
+
+For simplicity in the first iteration, we're specifying the device as a driver
+parameter in the config file and we're not trying to just read in a secondary
+configured backend device.
+
+    [driver-foo]
+    volume_driver=xxxx
+    valid_replication_devices='remote_device={'some unique access meta}',...
+
+NOTE That the remote_device access MUST be handled via the
+configured driver.
+
+* Add the following API calls
+  replication-enable/disable 'backend-name'
+
+  This will issue a command to the backend to update the capabilities being
+  reported for replication.
+
+  replication-failover [--freeze] 'backend-name'
+  This triggers the failover event, assuming that the current primary
+  backend is no longer accessible.
+
+  replication-thaw 'backend-name'
+  Thaw a backend that experienced a failover and is frozen
+
+Special considerations
+-----------------------
+
+* async vs sync
+  This spec does not make any assumptions about what replication method
+  the backend uses, nor does it care.
+
+* transport
+  Implementation details and the *how* the backend performs replication
+  is completely up to the backend.  The requirements are that the interfaces
+  and end results are consistent.
+
+* The Volume driver for the replicated backend MUST have the ability to
+  communicate with the other backend and route the calls correctly based on
+  what's selected as the current primary.  One example of an important detail
+  here is the "update stats" call.
+
+  In the case of a failover, it is expected that the secondary/target device is
+  now reporting stats/capabilities, NOT the now *dead* backend.
+
+* Tenant visibility
+  The visibility by tenants is LIMITED!!!  In other words the tenant
+  should know very little about what's going on.  The only information that
+  should really be propogated is that the backend and the volume is
+  in a "failed-over" state, and if it's "frozen".
+
+In the case of a failover where volumes are no longer available on the new
+backend, the driver should raise a NotFound Exception for an API calls that
+attempt to access them.
+
+
+Alternatives
+------------
+
+There are all sorts of alternatives, the most obvious of which is to leave
+the implementation we have and iron it out.  Maybe that's good, maybe that's
+not.  In my opinion this approach is simpler, easier to maintain and more
+flexible; otherwise I wouldn't propose it.  The fact that there's only
+one vendor that's implemented replication in the existing setup and they
+have a number of open issues currently we're not causing a terrible amount
+of churn or disturbance if we move forward with this now.
+
+The result will be something that should be easier to implement and as an
+option will have less impact on the core code.
+
+One appealing option would be to leave Cinder more cloud-like and not even
+offer replication.
+
+Data model impact
+-----------------
+
+We'll need a new column in the host table that indicates "failed-over" and
+"frozen"  status.
+
+We'll also need a new property for volumes, indicating if they're failed-over
+and if they're frozen or not.
+
+Finally, to plan for cases where perhaps a backend has multiple replication
+targets, we need to provide them a mechanism to persist some ID info as to
+where the fail-over was sent to.  In other words, make sure the driver has
+a way to set things back up correctly on an init.
+
+REST API impact
+---------------
+
+replication-enable/disable 'backend-name'
+  This will issue a command to the backend to update the capabilities being
+  reported for replication.
+
+replication-failover [--freeze] 'backend-name'
+  This triggers the failover event, assuming that the current primary
+  backend is no longer accessible.
+
+
+Security impact
+---------------
+
+Describe any potential security impact on the system.  Some of the items to
+consider include:
+
+* Does this change touch sensitive data such as tokens, keys, or user data?
+
+  Nope
+
+* Does this change alter the API in a way that may impact security, such as
+  a new way to access sensitive information or a new way to login?
+
+  Nope, not that I know of
+
+* Does this change involve cryptography or hashing?
+
+  Nope, not that I know of
+
+* Does this change require the use of sudo or any elevated privileges?
+
+  Nope, not that I know of
+
+* Does this change involve using or parsing user-provided data? This could
+  be directly at the API level or indirectly such as changes to a cache layer.
+
+  Nope, not that I know of
+
+* Can this change enable a resource exhaustion attack, such as allowing a
+  single API interaction to consume significant server resources? Some examples
+  of this include launching subprocesses for each connection, or entity
+  expansion attacks in XML.
+
+  Nope, not that I know of
+
+For more detailed guidance, please see the OpenStack Security Guidelines as
+a reference (https://wiki.openstack.org/wiki/Security/Guidelines).  These
+guidelines are a work in progress and are designed to help you identify
+security best practices.  For further information, feel free to reach out
+to the OpenStack Security Group at openstack-security@lists.openstack.org.
+
+Notifications impact
+--------------------
+
+We'd certainly want to add a notification event that we "failed over"
+
+Also freeze/thaw, as well as enable/disable events.
+
+Other end user impact
+---------------------
+
+Aside from the API, are there other ways a user will interact with this
+feature?
+
+* Does this change have an impact on python-cinderclient? What does the user
+  interface there look like?
+
+TBD
+
+Performance Impact
+------------------
+
+Describe any potential performance impact on the system, for example
+how often will new code be called, and is there a major change to the calling
+pattern of existing code.
+
+Examples of things to consider here include:
+
+* A periodic task might look like a small addition but when considering
+  large scale deployments the proposed call may in fact be performed on
+  hundreds of nodes.
+
+* Scheduler filters get called once per host for every volume being created,
+  so any latency they introduce is linear with the size of the system.
+
+* A small change in a utility function or a commonly used decorator can have a
+  large impacts on performance.
+
+* Calls which result in a database queries can have a profound impact on
+  performance, especially in critical sections of code.
+
+* Will the change include any locking, and if so what considerations are there
+  on holding the lock?
+
+Other deployer impact
+---------------------
+
+Discuss things that will affect how you deploy and configure OpenStack
+that have not already been mentioned, such as:
+
+* What config options are being added? Should they be more generic than
+  proposed (for example a flag that other volume drivers might want to
+  implement as well)? Are the default values ones which will work well in
+  real deployments?
+
+* Is this a change that takes immediate effect after its merged, or is it
+  something that has to be explicitly enabled?
+
+* If this change is a new binary, how would it be deployed?
+
+* Please state anything that those doing continuous deployment, or those
+  upgrading from the previous release, need to be aware of. Also describe
+  any plans to deprecate configuration values or features.  For example, if we
+  change the directory name that targets (LVM) are stored in, how do we handle
+  any used directories created before the change landed?  Do we move them?  Do
+  we have a special case in the code? Do we assume that the operator will
+  recreate all the volumes in their cloud?
+
+Developer impact
+----------------
+
+Discuss things that will affect other developers working on OpenStack,
+such as:
+
+* If the blueprint proposes a change to the driver API, discussion of how
+  other volume drivers would implement the feature is required.
+
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Who is leading the writing of the code? Or is this a blueprint where you're
+throwing it out there to see who picks it up?
+
+If more than one person is working on the implementation, please designate the
+primary author and contact.
+
+Primary assignee:
+  john-griffith
+
+Other contributors:
+  <launchpad-id or None>
+
+Work Items
+----------
+
+Work items or tasks -- break the feature up into the things that need to be
+done to implement it. Those parts might end up being done by different people,
+but we're mostly trying to understand the timeline for implementation.
+
+
+Dependencies
+============
+
+* Include specific references to specs and/or blueprints in cinder, or in other
+  projects, that this one either depends on or is related to.
+
+* If this requires functionality of another project that is not currently used
+  by Cinder (such as the glance v2 API when we previously only required v1),
+  document that fact.
+
+* Does this feature require any new library dependencies or code otherwise not
+  included in OpenStack? Or does it depend on a specific version of library?
+
+* Need Horizon support
+
+Testing
+=======
+
+Please discuss how the change will be tested. We especially want to know what
+tempest tests will be added. It is assumed that unit test coverage will be
+added so that doesn't need to be mentioned explicitly, but discussion of why
+you think unit tests are sufficient and we don't need to add more tempest
+tests would need to be included.
+
+Is this untestable in gate given current limitations (specific hardware /
+software configurations available)? If so, are there mitigation plans (3rd
+party testing, gate enhancements, etc).
+
+
+Documentation Impact
+====================
+
+What is the impact on the docs team of this change? Some changes might require
+donating resources to the docs team to have the documentation updated. Don't
+repeat details discussed above, but please reference them here.
+
+Obviously this is going to need docs and devref info in cinder docs tree
+
+
+References
+==========
+
+Please add any useful references here. You are not required to have any
+reference. Moreover, this specification should still make sense when your
+references are unavailable. Examples of what you could include are:
+
+* Links to mailing list or IRC discussions
+
+* Links to notes from a summit session
+
+* Links to relevant research, if appropriate
+
+* Related specifications as appropriate (e.g. link to any vendor documentation)
+
+* Anything else you feel it is worthwhile to refer to
+
+  The specs process is a bit much, we should revisit it.  It's rather
+  bloated, and while the first few sections are fantastic for requiring
+  thought and planning, towards the end it just gets silly.