Merge "Update cheesecake promotion specification"
This commit is contained in:
commit
2f5701b912
@ -1,161 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
=====================================================
|
||||
Promote Secondary Backend After Failover (Cheesecake)
|
||||
=====================================================
|
||||
|
||||
https://blueprints.launchpad.net/cinder/+spec/replication-backend-promotion
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
After failing backend A over to backend B, there is not a mechanism in
|
||||
Cinder to promote backend B to the master backend in order to then replicate
|
||||
to a backend C.
|
||||
|
||||
Current Workflow
|
||||
----------------
|
||||
1. Setup Replication
|
||||
2. Failure Occurs
|
||||
3. Fail over
|
||||
4. Promote Secondary Backend
|
||||
a. Freeze backend to prevent manage operations
|
||||
b. Stop cinder volume service
|
||||
c. Update cinder.conf to have backend A replaced with B and B with C
|
||||
d. *Hack db to set backend to no longer be in 'failed-over' state*
|
||||
* This is the step this spec is concerned with
|
||||
* Example:
|
||||
::
|
||||
|
||||
update services set disabled=0,
|
||||
disabled_reason=NULL,
|
||||
replication_status='enabled',
|
||||
active_backend_id=NULL
|
||||
where id=3;
|
||||
e. Start cinder volume service
|
||||
f. Unfreeze backend
|
||||
|
||||
Use Cases
|
||||
=========
|
||||
There was a fire in my data center and my primary backend (A) was destroyed.
|
||||
Luckily, I was replicating that backend to backend (B). After failing over
|
||||
to backend B and repairing the data center, we installed backend C to be a
|
||||
new replication target for B.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
Add the following commands to reset the active backend for a host.
|
||||
::
|
||||
|
||||
cinder reset-active-backend <backend-name>
|
||||
PUT /os-services/reset_active_backend {"host": <backend-name>}
|
||||
|
||||
Equivalent to:
|
||||
::
|
||||
|
||||
update services set disabled=0,
|
||||
disabled_reason=NULL,
|
||||
replication_status='disabled',
|
||||
active_backend_id=NULL
|
||||
where host='<backend-name>;
|
||||
|
||||
The following to re-enable replication from backend B -> C.
|
||||
::
|
||||
|
||||
cinder replication-enable <backend-name>
|
||||
PUT /os-services/replication_enable {"host": <backend-name>}
|
||||
|
||||
Equivalent to:
|
||||
::
|
||||
|
||||
update services set replication_status='enabled'
|
||||
where host='<backend-name>;
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
Instead of 'admin' APIs, we add cinder-manage commands.
|
||||
::
|
||||
|
||||
cinder-manage reset-active-backend <backend-host>
|
||||
cinder-manage replication-enable <backend-name>
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
None
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
None
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
None
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
None
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
?
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Implement cinder reset-active-backend API
|
||||
* Implement cinder replication-enable API
|
||||
* Document post-fail-over recovery in Admin guide
|
||||
|
||||
Dependencies
|
||||
============
|
||||
None
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
None
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
Documentation in the Admin guide for how to perform a backend promotion.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
None
|
235
specs/queens/cheesecake-promote-backend.rst
Normal file
235
specs/queens/cheesecake-promote-backend.rst
Normal file
@ -0,0 +1,235 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
=====================================================
|
||||
Promote Replication Target (Cheesecake)
|
||||
=====================================================
|
||||
|
||||
https://blueprints.launchpad.net/cinder/+spec/replication-backend-promotion
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
After failing backend A over to backend B, there is not a mechanism in
|
||||
Cinder to promote backend B to the master backend in order to then replicate
|
||||
to a backend C. We also lack some management commands to help rebuild after a
|
||||
disaster if states get out of sync.
|
||||
|
||||
Current Workflow
|
||||
----------------
|
||||
1. Setup Replication
|
||||
2. Failure Occurs
|
||||
3. Fail over
|
||||
4. Promote Secondary Backend
|
||||
a. Freeze backend to prevent manage operations
|
||||
b. Stop cinder volume service
|
||||
c. Update cinder.conf to have backend A replaced with B and B with C
|
||||
d. *Hack db to set backend to no longer be in 'failed-over' state*
|
||||
* This is the step this spec is concerned with
|
||||
* Example:
|
||||
::
|
||||
|
||||
update services set disabled=0,
|
||||
disabled_reason=NULL,
|
||||
replication_status='enabled',
|
||||
active_backend_id=NULL
|
||||
where id=3;
|
||||
e. Start cinder volume service
|
||||
f. Unfreeze backend
|
||||
|
||||
Use Cases
|
||||
=========
|
||||
There was a fire in my data center and my primary backend (A) was destroyed.
|
||||
Luckily, I was replicating that backend to backend (B). After failing over
|
||||
to backend B and repairing the data center, we installed backend C to be a
|
||||
new replication target for B.
|
||||
|
||||
There is also a case where, for whatever reason, the replication state gets out
|
||||
of sync with reality. In a situation where the status and active backend id
|
||||
need to be adjusted manually by the cloud admin while recovering from a
|
||||
disaster.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
To handle the case where the Cinder DB just needs to be synchronized with the
|
||||
real world we will add the following cinder manage commands to reset the active
|
||||
backend for a host. Similar to reset-state for volumes this will just do DB
|
||||
operations. The assumption being that the cinder.conf has already been updated,
|
||||
and the volume service is stopped, in a disabled state, and probably frozen.
|
||||
The manage command will verify that the service is stopped, disabled, and
|
||||
frozen.
|
||||
|
||||
::
|
||||
|
||||
cinder-manage reset-active-backend replication_status=<status> \
|
||||
<active_backend_id> <backend-host>
|
||||
|
||||
Equivalent to:
|
||||
::
|
||||
|
||||
update services set replication_status='<status>',
|
||||
active_backend_id='<id>',
|
||||
where host='<backend-name>';
|
||||
|
||||
Where the defaults for `status` will be disabled and the `active_backend_id`
|
||||
will be None. The target for this could also be a cluster in A-A deployments.
|
||||
|
||||
Note: It will be up to the Admin to re-enabled the service.
|
||||
|
||||
That gives us the ability to avoid having an admin go manually run DB commands,
|
||||
but does *NOT* allow for doing this in an "online" recovery. The volume service
|
||||
must be offline for this to work safely.
|
||||
|
||||
Making this change will require drivers to make adjustments to their
|
||||
replication states upon initialization. As-is we don't really have much
|
||||
definition around what is allowed to change in cinder.conf and how much the
|
||||
driver should support for changes to replication targets and things like that.
|
||||
There is sort of an implied contract that when `__init__` or `do_setup` is
|
||||
called the driver *should* make their current replication state match what they
|
||||
are given in the config. This is a little bit problematic today though as there
|
||||
isn't a mechanism for them to update DB entries like
|
||||
`replication_extended_status` or `replication_driver_data`. To solve that gap,
|
||||
and allow for this new officially supported method of modifying replication
|
||||
status when offline, we will introduce a new driver api method
|
||||
::
|
||||
|
||||
update_host_replication(replication_status, active_backend_id, volumes, groups)
|
||||
|
||||
|
||||
This method will be called immediately following `do_setup` and will return a
|
||||
model update for the service as well as a list of updates for the volumes and
|
||||
groups if the driver supports replication groups. The `group` parameter will
|
||||
default to None if not defined.
|
||||
|
||||
The drivers will need to be able to take appropriate steps to get the system
|
||||
into the desired state based on the current DB state (which was theoretically
|
||||
modifed before startup by the new cinder-manage command) and the current
|
||||
cinder.conf.
|
||||
|
||||
If not implemented things will continue to work as they do today, and require
|
||||
that the admin potentially take more drastic measures to recover after
|
||||
performing a failover. The goal here is to not break any existing
|
||||
functionality, but to add enough infrastructure for drivers to behave better.
|
||||
|
||||
When we do this we might require some way of fencing to prevent multiple A-A
|
||||
driver instances from doing the setup at the same time as it will more than
|
||||
likely be problematic for some backends, and could risk data loss if more than
|
||||
one is modifying replication states at the same time. As a simple solution we
|
||||
could use a new replication_status like `updating` for the service, and only
|
||||
allow them to call setup_replication if the status is not set to that. This
|
||||
status can also be beneficial for an admin to know what is happening if the
|
||||
process of updating via the driver takes noticeable amounts of time.
|
||||
|
||||
Doing it this way should also allow for later on making "online" updates where
|
||||
we can utilize that same driver hook to modify replication states. This spec
|
||||
and initial implementation does not aim to cover that scenario.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
We could add admin API's to Cinder. Those API's could do the DB updates and
|
||||
ping the drivers. The downside is that it requires the API and volume service
|
||||
to be online, which may be problematic in the scenario that you are picking up
|
||||
pieces after a disaster.
|
||||
|
||||
Later on we can look into doing "online" promotions where the volume service
|
||||
does not need to be offline. Similar code in the drivers would be required, but
|
||||
the complexity gets increased rapidly by trying to support this.
|
||||
|
||||
There was also discussion about using new admin API's which would modify a db
|
||||
state that tracks replication info. The downside to this is that we will move
|
||||
into a scenario where the running config and state doesn't match the configured
|
||||
state.
|
||||
|
||||
Following on that path of tracking replication state in the db, we could go to
|
||||
the extreme and move all of the replication configuration to be done via API's.
|
||||
We can then track state, and provide drivers with diffs as the state changes.
|
||||
In the longer term that addresses the runtime vs config state disparity, but it
|
||||
will be a significant change in workflow and deployments. Not to mention would
|
||||
require somewhat major changes to drivers implementing replication.
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
A new status for the service/cluster will be added.
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
None
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
Volume service startup will probably take a performance hit, depending on the
|
||||
backend and how many replicated volumes need to be modified and updated.
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
Driver maintainers will need to potentially implement this new functionality,
|
||||
and be aware of the implications of how/when replication configuration and
|
||||
status can be adjusted.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
jbernard
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Implement cinder-manage reset-active-backend command
|
||||
* Implement volume manager changes to allow for `update_host_replication` to be
|
||||
used at startup by drivers.
|
||||
* Open a bug against each backend that supports replication and needs an
|
||||
update as a result of this change.
|
||||
|
||||
Dependencies
|
||||
============
|
||||
None
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
None
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
Documentation in the Admin guide for how to perform a backend promotion, and
|
||||
updating the devref for driver developers to explain the expectations of
|
||||
drivers implementing replication.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
None
|
Loading…
x
Reference in New Issue
Block a user