Merge "Update cheesecake promotion specification"

2017-11-13 05:33:22 +00:00 · 2017-11-13 05:33:22 +00:00 · 2f5701b912
commit 2f5701b912
parent 3581aba6ae 6d92d916cc
2 changed files with 235 additions and 161 deletions
--- a/specs/newton/cheesecake-promote-backend.rst
+++ b/specs/newton/cheesecake-promote-backend.rst
@ -1,161 +0,0 @@
-..
- This work is licensed under a Creative Commons Attribution 3.0 Unported
- License.
-
- http://creativecommons.org/licenses/by/3.0/legalcode
-
-=====================================================
-Promote Secondary Backend After Failover (Cheesecake)
-=====================================================
-
-https://blueprints.launchpad.net/cinder/+spec/replication-backend-promotion
-
-Problem description
-===================
-
-After failing backend A over to backend B, there is not a mechanism in
-Cinder to promote backend B to the master backend in order to then replicate
-to a backend C.
-
-Current Workflow
----------------
-1. Setup Replication
-2. Failure Occurs
-3. Fail over
-4. Promote Secondary Backend
-  a. Freeze backend to prevent manage operations
-  b. Stop cinder volume service
-  c. Update cinder.conf to have backend A replaced with B and B with C
-  d. *Hack db to set backend to no longer be in 'failed-over' state*
-    * This is the step this spec is concerned with
-    * Example:
-      ::
-
-        update services set disabled=0,
-                            disabled_reason=NULL,
-                            replication_status='enabled',
-                            active_backend_id=NULL
-          where id=3;
-  e. Start cinder volume service
-  f. Unfreeze backend
-
-Use Cases
-=========
-There was a fire in my data center and my primary backend (A) was destroyed.
-Luckily, I was replicating that backend to backend (B). After failing over
-to backend B and repairing the data center, we installed backend C to be a
-new replication target for B.
-
-Proposed change
-===============
-
-Add the following commands to reset the active backend for a host.
-::
-
-    cinder reset-active-backend <backend-name>
-    PUT /os-services/reset_active_backend {"host": <backend-name>}
-
-Equivalent to:
-::
-
-    update services set disabled=0,
-                        disabled_reason=NULL,
-                        replication_status='disabled',
-                        active_backend_id=NULL
-       where host='<backend-name>;
-
-The following to re-enable replication from backend B -> C.
-::
-
-    cinder replication-enable <backend-name>
-    PUT /os-services/replication_enable {"host": <backend-name>}
-
-Equivalent to:
-::
-
-    update services set replication_status='enabled'
-       where host='<backend-name>;
-
-Alternatives
------------
-
-Instead of 'admin' APIs, we add cinder-manage commands.
-::
-
-    cinder-manage reset-active-backend <backend-host>
-    cinder-manage replication-enable <backend-name>
-
-Data model impact
-----------------
-
-None
-
-REST API impact
---------------
-
-None
-
-Security impact
---------------
-
-None
-
-Notifications impact
--------------------
-
-None
-
-Other end user impact
---------------------
-
-None
-
-Performance Impact
------------------
-
-None
-
-Other deployer impact
---------------------
-
-None
-
-Developer impact
----------------
-
-None
-
-Implementation
-==============
-
-Assignee(s)
-----------
-
-Primary assignee:
-  ?
-
-Work Items
----------
-
-* Implement cinder reset-active-backend API
-* Implement cinder replication-enable API
-* Document post-fail-over recovery in Admin guide
-
-Dependencies
-============
-None
-
-Testing
-=======
-
-None
-
-Documentation Impact
-====================
-
-Documentation in the Admin guide for how to perform a backend promotion.
-
-References
-==========
-
-None
--- a/specs/queens/cheesecake-promote-backend.rst
+++ b/specs/queens/cheesecake-promote-backend.rst
@ -0,0 +1,235 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+=====================================================
+Promote Replication Target (Cheesecake)
+=====================================================
+
+https://blueprints.launchpad.net/cinder/+spec/replication-backend-promotion
+
+Problem description
+===================
+
+After failing backend A over to backend B, there is not a mechanism in
+Cinder to promote backend B to the master backend in order to then replicate
+to a backend C. We also lack some management commands to help rebuild after a
+disaster if states get out of sync.
+
+Current Workflow
+----------------
+1. Setup Replication
+2. Failure Occurs
+3. Fail over
+4. Promote Secondary Backend
+  a. Freeze backend to prevent manage operations
+  b. Stop cinder volume service
+  c. Update cinder.conf to have backend A replaced with B and B with C
+  d. *Hack db to set backend to no longer be in 'failed-over' state*
+    * This is the step this spec is concerned with
+    * Example:
+      ::
+
+        update services set disabled=0,
+                            disabled_reason=NULL,
+                            replication_status='enabled',
+                            active_backend_id=NULL
+          where id=3;
+  e. Start cinder volume service
+  f. Unfreeze backend
+
+Use Cases
+=========
+There was a fire in my data center and my primary backend (A) was destroyed.
+Luckily, I was replicating that backend to backend (B). After failing over
+to backend B and repairing the data center, we installed backend C to be a
+new replication target for B.
+
+There is also a case where, for whatever reason, the replication state gets out
+of sync with reality. In a situation where the status and active backend id
+need to be adjusted manually by the cloud admin while recovering from a
+disaster.
+
+Proposed change
+===============
+
+To handle the case where the Cinder DB just needs to be synchronized with the
+real world we will add the following cinder manage commands to reset the active
+backend for a host. Similar to reset-state for volumes this will just do DB
+operations. The assumption being that the cinder.conf has already been updated,
+and the volume service is stopped, in a disabled state, and probably frozen.
+The manage command will verify that the service is stopped, disabled, and
+frozen.
+
+::
+
+    cinder-manage reset-active-backend replication_status=<status> \
+        <active_backend_id> <backend-host>
+
+Equivalent to:
+::
+
+    update services set replication_status='<status>',
+                        active_backend_id='<id>',
+       where host='<backend-name>';
+
+Where the defaults for `status` will be disabled and the `active_backend_id`
+will be None. The target for this could also be a cluster in A-A deployments.
+
+Note: It will be up to the Admin to re-enabled the service.
+
+That gives us the ability to avoid having an admin go manually run DB commands,
+but does *NOT* allow for doing this in an "online" recovery. The volume service
+must be offline for this to work safely.
+
+Making this change will require drivers to make adjustments to their
+replication states upon initialization. As-is we don't really have much
+definition around what is allowed to change in cinder.conf and how much the
+driver should support for changes to replication targets and things like that.
+There is sort of an implied contract that when `__init__` or `do_setup` is
+called the driver *should* make their current replication state match what they
+are given in the config. This is a little bit problematic today though as there
+isn't a mechanism for them to update DB entries like
+`replication_extended_status` or `replication_driver_data`. To solve that gap,
+and allow for this new officially supported method of modifying replication
+status when offline, we will introduce a new driver api method
+::
+
+  update_host_replication(replication_status, active_backend_id, volumes, groups)
+
+
+This method will be called immediately following `do_setup` and will return a
+model update for the service as well as a list of updates for the volumes and
+groups if the driver supports replication groups.  The `group` parameter will
+default to None if not defined.
+
+The drivers will need to be able to take appropriate steps to get the system
+into the desired state based on the current DB state (which was theoretically
+modifed before startup by the new cinder-manage command) and the current
+cinder.conf.
+
+If not implemented things will continue to work as they do today, and require
+that the admin potentially take more drastic measures to recover after
+performing a failover. The goal here is to not break any existing
+functionality, but to add enough infrastructure for drivers to behave better.
+
+When we do this we might require some way of fencing to prevent multiple A-A
+driver instances from doing the setup at the same time as it will more than
+likely be problematic for some backends, and could risk data loss if more than
+one is modifying replication states at the same time. As a simple solution we
+could use a new replication_status like `updating` for the service, and only
+allow them to call setup_replication if the status is not set to that. This
+status can also be beneficial for an admin to know what is happening if the
+process of updating via the driver takes noticeable amounts of time.
+
+Doing it this way should also allow for later on making "online" updates where
+we can utilize that same driver hook to modify replication states. This spec
+and initial implementation does not aim to cover that scenario.
+
+Alternatives
+------------
+
+We could add admin API's to Cinder. Those API's could do the DB updates and
+ping the drivers. The downside is that it requires the API and volume service
+to be online, which may be problematic in the scenario that you are picking up
+pieces after a disaster.
+
+Later on we can look into doing "online" promotions where the volume service
+does not need to be offline. Similar code in the drivers would be required, but
+the complexity gets increased rapidly by trying to support this.
+
+There was also discussion about using new admin API's which would modify a db
+state that tracks replication info. The downside to this is that we will move
+into a scenario where the running config and state doesn't match the configured
+state.
+
+Following on that path of tracking replication state in the db, we could go to
+the extreme and move all of the replication configuration to be done via API's.
+We can then track state, and provide drivers with diffs as the state changes.
+In the longer term that addresses the runtime vs config state disparity, but it
+will be a significant change in workflow and deployments. Not to mention would
+require somewhat major changes to drivers implementing replication.
+
+Data model impact
+-----------------
+
+A new status for the service/cluster will be added.
+
+REST API impact
+---------------
+
+None
+
+Security impact
+---------------
+
+None
+
+Notifications impact
+--------------------
+
+None
+
+Other end user impact
+---------------------
+
+None
+
+Performance Impact
+------------------
+
+Volume service startup will probably take a performance hit, depending on the
+backend and how many replicated volumes need to be modified and updated.
+
+Other deployer impact
+---------------------
+
+None
+
+Developer impact
+----------------
+
+Driver maintainers will need to potentially implement this new functionality,
+and be aware of the implications of how/when replication configuration and
+status can be adjusted.
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Primary assignee:
+  jbernard
+
+Work Items
+----------
+
+* Implement cinder-manage reset-active-backend command
+* Implement volume manager changes to allow for `update_host_replication` to be
+  used at startup by drivers.
+* Open a bug against each backend that supports replication and needs an
+  update as a result of this change.
+
+Dependencies
+============
+None
+
+Testing
+=======
+
+None
+
+Documentation Impact
+====================
+
+Documentation in the Admin guide for how to perform a backend promotion, and
+updating the devref for driver developers to explain the expectations of
+drivers implementing replication.
+
+References
+==========
+
+None