Spec for Cheesecake approach to replication
Scale back a bit again, nail this one before M since nothing has released on old version and move on to Tiramisu. Change-Id: I9fc6870ea657906e9ff35b3134e6b61bf69a4193
This commit is contained in:
parent
8afa9c447b
commit
be5cbb9bd2
435
specs/mitaka/cheesecake.rst
Normal file
435
specs/mitaka/cheesecake.rst
Normal file
@ -0,0 +1,435 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
==========================================
|
||||
Cheesecake
|
||||
==========================================
|
||||
|
||||
Include the URL of your launchpad blueprint:
|
||||
|
||||
https://blueprints.launchpad.net/cinder/+spec/replication
|
||||
|
||||
This spec proposes further refinement to the Cinder replication.
|
||||
After more vendors have tried to implement replication and we've
|
||||
learned more lessons about the differences in backends and their
|
||||
semantics we've decided we should step back and look at simplifying
|
||||
this even further.
|
||||
|
||||
The goal of the new design is to address a large amount of confusion
|
||||
and differences in interpretation. Rather than try and cover multiple
|
||||
use cases in the first iteration, this spec aims to address a single
|
||||
fairly well defined use case. Then we can iterate and move on from
|
||||
there.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
The existing design is great for some backends, but is challenging for many
|
||||
devices to fit in to. It's also filled with pitfalls with the question of
|
||||
managed/unmanaged, not to mention trying to deal with failing over some
|
||||
volumes and leaving others. The concept of failing over on a volume basis
|
||||
instead of on a device basis while nice for testing doesn't fit well into
|
||||
the intended use case and results in quite a bit of complexity and also is not
|
||||
something that a number of backends can even support.
|
||||
|
||||
Use Cases
|
||||
=========
|
||||
This is intended to be a DR mechanism. The model Use Case is a catastrophic
|
||||
event occurring on the backend storage device, but some or all volumes that
|
||||
were on the primary backend may have been replicated to another backend device
|
||||
in which case those volumes may still be accessible.
|
||||
|
||||
The flow of events are as follows:
|
||||
1. Admin configures a backend device to enable replication. We have a
|
||||
configured cinder backend just as always (Backend-A) but we add config
|
||||
options for a replication target (Backend-B).
|
||||
|
||||
a. We no longer deal with differentiation between managed and unmanaged
|
||||
Now, to enable a replication target(s), the replication_target entry
|
||||
is the ONLY method allowed and is specified as a section in the driver.
|
||||
|
||||
b. Depending on the back-end device enabling this may mean that EVERY
|
||||
volume created on the device is replicated, or for those that have
|
||||
the capability and if admins choose to do so a Volume-Type of
|
||||
"replicated=True" can be created and used by tenants.
|
||||
|
||||
Note that if the backend only supports replicating "all" volumes, or
|
||||
if the Admin wants to set things up so that "all" volumes are
|
||||
replicated that the Type creation may or may not be necessary.
|
||||
|
||||
2. Tenant creates a Volume that is replicated (either by specifying
|
||||
appropriate Type, or by the nature of the backend device)
|
||||
Result in this example is a Volume we'll call "Foo"
|
||||
|
||||
3. Backend-A is caught in the crossfire of a water balloon fight that
|
||||
shouldn't have been taking place in the data center, and looses it's magic
|
||||
smoke, "It's dead Jim!"
|
||||
|
||||
4. Admin issues "cinder replication-failover" command with possible arguments
|
||||
a. Call propagates to Cinder Driver, which performs appropriate steps for
|
||||
that driver to now point to the secondary (target) device (Backend-B).
|
||||
|
||||
b. The Service Table in Cinder's database is updated to indicate that a
|
||||
replication failover event has occurred, and the driver is currently
|
||||
pointing to an alternate target device.
|
||||
|
||||
In this scenario volumes that were replicated should still be accessible by
|
||||
tenants. The usage may or may not be restricted depending on options
|
||||
provided in the failover command. If no restrictions are set we expect to
|
||||
be able to continue using them as we would prior to the failure event.
|
||||
|
||||
Volumes that were attached/in-use are a special case in this scenario and
|
||||
will require additional steps. The Tenant will be required in this case to
|
||||
detach the volumes from any instances manually. Cinder does not have the
|
||||
ability to call Nova's volume-detach methods, so this has to be done by the
|
||||
Tenant or the Admin.
|
||||
|
||||
c. Freeze option provided as an argument to Failover
|
||||
The failover command includes a "freeze" option. This option indicates
|
||||
that a volume may still be read or written to, HOWEVER that we will not
|
||||
allow any additional resource create or delete options until an admin
|
||||
issues a "thaw" command. This means that attempts to call
|
||||
snapshot-create, xxx-delete, resize, retype etc should return an
|
||||
InvalidCommand error. This is intended to try and keep things in as
|
||||
stable of a state as possible, to help in recovering from the
|
||||
catastrophic event. We think of this as the backend resources becoming
|
||||
ReadOnly from a management/control plan perspective. This does not mean
|
||||
you can't R/W IO from an instance to the volume.
|
||||
|
||||
5. How to get back to "normal"
|
||||
a. If the original backend device is salvageable, the failover command
|
||||
should be used to switch back to the original primary device. This of
|
||||
course means that there should be some mechanism on the backend and
|
||||
operations performed by the Admin that ensures the resources still exist on
|
||||
the Primary (Backend-A) and that their data is updated based on what may
|
||||
have been written while they were hosted on Backend-B. This indicates that
|
||||
for backends to support this something like 2-way replication is going to
|
||||
be required. For backends that can't support this, it's likely that we'll
|
||||
need to instead swap the primary and secondary configuration info
|
||||
(Reconfigure making Backend-B the Primary).
|
||||
|
||||
|
||||
|
||||
It's important to emphasize, if the volume is not of type "replicated" it will
|
||||
NOT be accessible after the failover. This approach fails over the entire
|
||||
backend to another device.
|
||||
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
One of the goals of this patch is to try and eliminate some of the challenges
|
||||
with the differences between manage and unmanaged replication targets. In this
|
||||
model we make this easier for backends. Rather than having some volumes on
|
||||
one backend and some on another and not doing things like stats update, we now
|
||||
fail over the entire backend including stats updates and everything.
|
||||
|
||||
This does mean that non-replicated type volumes will be left behind and
|
||||
inaccessible (unavailable), that's an expectation in this use case (the
|
||||
device burst into flames). We should treat these volumes just like we
|
||||
currently treat volumes in a scenario where somebody disconnects a backend.
|
||||
That's essentially what is happening here and it's no different really.
|
||||
|
||||
For simplicity in the first iteration, we're specifying the device as a driver
|
||||
parameter in the config file and we're not trying to just read in a secondary
|
||||
configured backend device.
|
||||
|
||||
[driver-foo]
|
||||
volume_driver=xxxx
|
||||
valid_replication_devices='remote_device={'some unique access meta}',...
|
||||
|
||||
NOTE That the remote_device access MUST be handled via the
|
||||
configured driver.
|
||||
|
||||
* Add the following API calls
|
||||
replication-enable/disable 'backend-name'
|
||||
|
||||
This will issue a command to the backend to update the capabilities being
|
||||
reported for replication.
|
||||
|
||||
replication-failover [--freeze] 'backend-name'
|
||||
This triggers the failover event, assuming that the current primary
|
||||
backend is no longer accessible.
|
||||
|
||||
replication-thaw 'backend-name'
|
||||
Thaw a backend that experienced a failover and is frozen
|
||||
|
||||
Special considerations
|
||||
-----------------------
|
||||
|
||||
* async vs sync
|
||||
This spec does not make any assumptions about what replication method
|
||||
the backend uses, nor does it care.
|
||||
|
||||
* transport
|
||||
Implementation details and the *how* the backend performs replication
|
||||
is completely up to the backend. The requirements are that the interfaces
|
||||
and end results are consistent.
|
||||
|
||||
* The Volume driver for the replicated backend MUST have the ability to
|
||||
communicate with the other backend and route the calls correctly based on
|
||||
what's selected as the current primary. One example of an important detail
|
||||
here is the "update stats" call.
|
||||
|
||||
In the case of a failover, it is expected that the secondary/target device is
|
||||
now reporting stats/capabilities, NOT the now *dead* backend.
|
||||
|
||||
* Tenant visibility
|
||||
The visibility by tenants is LIMITED!!! In other words the tenant
|
||||
should know very little about what's going on. The only information that
|
||||
should really be propogated is that the backend and the volume is
|
||||
in a "failed-over" state, and if it's "frozen".
|
||||
|
||||
In the case of a failover where volumes are no longer available on the new
|
||||
backend, the driver should raise a NotFound Exception for an API calls that
|
||||
attempt to access them.
|
||||
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
There are all sorts of alternatives, the most obvious of which is to leave
|
||||
the implementation we have and iron it out. Maybe that's good, maybe that's
|
||||
not. In my opinion this approach is simpler, easier to maintain and more
|
||||
flexible; otherwise I wouldn't propose it. The fact that there's only
|
||||
one vendor that's implemented replication in the existing setup and they
|
||||
have a number of open issues currently we're not causing a terrible amount
|
||||
of churn or disturbance if we move forward with this now.
|
||||
|
||||
The result will be something that should be easier to implement and as an
|
||||
option will have less impact on the core code.
|
||||
|
||||
One appealing option would be to leave Cinder more cloud-like and not even
|
||||
offer replication.
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
We'll need a new column in the host table that indicates "failed-over" and
|
||||
"frozen" status.
|
||||
|
||||
We'll also need a new property for volumes, indicating if they're failed-over
|
||||
and if they're frozen or not.
|
||||
|
||||
Finally, to plan for cases where perhaps a backend has multiple replication
|
||||
targets, we need to provide them a mechanism to persist some ID info as to
|
||||
where the fail-over was sent to. In other words, make sure the driver has
|
||||
a way to set things back up correctly on an init.
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
replication-enable/disable 'backend-name'
|
||||
This will issue a command to the backend to update the capabilities being
|
||||
reported for replication.
|
||||
|
||||
replication-failover [--freeze] 'backend-name'
|
||||
This triggers the failover event, assuming that the current primary
|
||||
backend is no longer accessible.
|
||||
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
Describe any potential security impact on the system. Some of the items to
|
||||
consider include:
|
||||
|
||||
* Does this change touch sensitive data such as tokens, keys, or user data?
|
||||
|
||||
Nope
|
||||
|
||||
* Does this change alter the API in a way that may impact security, such as
|
||||
a new way to access sensitive information or a new way to login?
|
||||
|
||||
Nope, not that I know of
|
||||
|
||||
* Does this change involve cryptography or hashing?
|
||||
|
||||
Nope, not that I know of
|
||||
|
||||
* Does this change require the use of sudo or any elevated privileges?
|
||||
|
||||
Nope, not that I know of
|
||||
|
||||
* Does this change involve using or parsing user-provided data? This could
|
||||
be directly at the API level or indirectly such as changes to a cache layer.
|
||||
|
||||
Nope, not that I know of
|
||||
|
||||
* Can this change enable a resource exhaustion attack, such as allowing a
|
||||
single API interaction to consume significant server resources? Some examples
|
||||
of this include launching subprocesses for each connection, or entity
|
||||
expansion attacks in XML.
|
||||
|
||||
Nope, not that I know of
|
||||
|
||||
For more detailed guidance, please see the OpenStack Security Guidelines as
|
||||
a reference (https://wiki.openstack.org/wiki/Security/Guidelines). These
|
||||
guidelines are a work in progress and are designed to help you identify
|
||||
security best practices. For further information, feel free to reach out
|
||||
to the OpenStack Security Group at openstack-security@lists.openstack.org.
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
We'd certainly want to add a notification event that we "failed over"
|
||||
|
||||
Also freeze/thaw, as well as enable/disable events.
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
Aside from the API, are there other ways a user will interact with this
|
||||
feature?
|
||||
|
||||
* Does this change have an impact on python-cinderclient? What does the user
|
||||
interface there look like?
|
||||
|
||||
TBD
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
Describe any potential performance impact on the system, for example
|
||||
how often will new code be called, and is there a major change to the calling
|
||||
pattern of existing code.
|
||||
|
||||
Examples of things to consider here include:
|
||||
|
||||
* A periodic task might look like a small addition but when considering
|
||||
large scale deployments the proposed call may in fact be performed on
|
||||
hundreds of nodes.
|
||||
|
||||
* Scheduler filters get called once per host for every volume being created,
|
||||
so any latency they introduce is linear with the size of the system.
|
||||
|
||||
* A small change in a utility function or a commonly used decorator can have a
|
||||
large impacts on performance.
|
||||
|
||||
* Calls which result in a database queries can have a profound impact on
|
||||
performance, especially in critical sections of code.
|
||||
|
||||
* Will the change include any locking, and if so what considerations are there
|
||||
on holding the lock?
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
Discuss things that will affect how you deploy and configure OpenStack
|
||||
that have not already been mentioned, such as:
|
||||
|
||||
* What config options are being added? Should they be more generic than
|
||||
proposed (for example a flag that other volume drivers might want to
|
||||
implement as well)? Are the default values ones which will work well in
|
||||
real deployments?
|
||||
|
||||
* Is this a change that takes immediate effect after its merged, or is it
|
||||
something that has to be explicitly enabled?
|
||||
|
||||
* If this change is a new binary, how would it be deployed?
|
||||
|
||||
* Please state anything that those doing continuous deployment, or those
|
||||
upgrading from the previous release, need to be aware of. Also describe
|
||||
any plans to deprecate configuration values or features. For example, if we
|
||||
change the directory name that targets (LVM) are stored in, how do we handle
|
||||
any used directories created before the change landed? Do we move them? Do
|
||||
we have a special case in the code? Do we assume that the operator will
|
||||
recreate all the volumes in their cloud?
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
Discuss things that will affect other developers working on OpenStack,
|
||||
such as:
|
||||
|
||||
* If the blueprint proposes a change to the driver API, discussion of how
|
||||
other volume drivers would implement the feature is required.
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Who is leading the writing of the code? Or is this a blueprint where you're
|
||||
throwing it out there to see who picks it up?
|
||||
|
||||
If more than one person is working on the implementation, please designate the
|
||||
primary author and contact.
|
||||
|
||||
Primary assignee:
|
||||
john-griffith
|
||||
|
||||
Other contributors:
|
||||
<launchpad-id or None>
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
Work items or tasks -- break the feature up into the things that need to be
|
||||
done to implement it. Those parts might end up being done by different people,
|
||||
but we're mostly trying to understand the timeline for implementation.
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
* Include specific references to specs and/or blueprints in cinder, or in other
|
||||
projects, that this one either depends on or is related to.
|
||||
|
||||
* If this requires functionality of another project that is not currently used
|
||||
by Cinder (such as the glance v2 API when we previously only required v1),
|
||||
document that fact.
|
||||
|
||||
* Does this feature require any new library dependencies or code otherwise not
|
||||
included in OpenStack? Or does it depend on a specific version of library?
|
||||
|
||||
* Need Horizon support
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Please discuss how the change will be tested. We especially want to know what
|
||||
tempest tests will be added. It is assumed that unit test coverage will be
|
||||
added so that doesn't need to be mentioned explicitly, but discussion of why
|
||||
you think unit tests are sufficient and we don't need to add more tempest
|
||||
tests would need to be included.
|
||||
|
||||
Is this untestable in gate given current limitations (specific hardware /
|
||||
software configurations available)? If so, are there mitigation plans (3rd
|
||||
party testing, gate enhancements, etc).
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
What is the impact on the docs team of this change? Some changes might require
|
||||
donating resources to the docs team to have the documentation updated. Don't
|
||||
repeat details discussed above, but please reference them here.
|
||||
|
||||
Obviously this is going to need docs and devref info in cinder docs tree
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
Please add any useful references here. You are not required to have any
|
||||
reference. Moreover, this specification should still make sense when your
|
||||
references are unavailable. Examples of what you could include are:
|
||||
|
||||
* Links to mailing list or IRC discussions
|
||||
|
||||
* Links to notes from a summit session
|
||||
|
||||
* Links to relevant research, if appropriate
|
||||
|
||||
* Related specifications as appropriate (e.g. link to any vendor documentation)
|
||||
|
||||
* Anything else you feel it is worthwhile to refer to
|
||||
|
||||
The specs process is a bit much, we should revisit it. It's rather
|
||||
bloated, and while the first few sections are fantastic for requiring
|
||||
thought and planning, towards the end it just gets silly.
|
Loading…
Reference in New Issue
Block a user