2263701157
In addition, moving specs from Kilo into Liberty that belong to blueprints that were not completed. Change-Id: Ibabce077579742b237a8cdb577414cc26228ec08
345 lines
15 KiB
ReStructuredText
345 lines
15 KiB
ReStructuredText
..
|
|
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
|
License.
|
|
|
|
http://creativecommons.org/licenses/by/3.0/legalcode
|
|
|
|
=====================
|
|
Volume state enforcer
|
|
=====================
|
|
|
|
https://blueprints.launchpad.net/cinder/+spec/cinder-state-enforcer
|
|
|
|
Concurrent resource access in cinder is a problem that has caused resource
|
|
corruption when simultaneous resources are mutated on by multiple cinder
|
|
entrypoints (api and manager for example). In Icehouse there has been some
|
|
addition & usage of locks around manager functions to queue up those requests
|
|
when a resource is being simultaneous worked on by multiple
|
|
functions (this stops one of those operations from concurrently mutating the
|
|
underlying resource). Sadly this is more of a *sledgehammer* approach and
|
|
hides the symptoms of the problem and makes it non-obvious when debugging what
|
|
other requests are queued up behind the lock (or why dead-locking is
|
|
occurring, if and when it does).
|
|
|
|
To help alleviate and hopefully solve this problem we will try to attack some
|
|
of these issues in a different manner, integrating a *allowed* state transition
|
|
table into the ``create_volume`` workflow and doing *strategic* state
|
|
transitions and aborting/erroring out when these state transitions are not
|
|
allowed. In the future this will help create a concrete set of well defined
|
|
states and transitions for other workflows as well (and will make it clear
|
|
while looking at code and during debugging which transitions are allowed at the
|
|
same time and what transitions are actively occurring).
|
|
|
|
Problem description
|
|
===================
|
|
|
|
A high-level description of the problem:
|
|
|
|
* Concurrent resource mutation, bad (EOM).
|
|
|
|
More detailed description:
|
|
|
|
* Locks in cinder are being added to protect against simultaneous resource
|
|
modification, for example in ``create_volume, attach_volume,``
|
|
``delete_volume, detach_volume...`` a external lock is acquired in the
|
|
manager with name ``volume_id, f.__name__``. This has helped make the
|
|
manager more safe to concurrent resource access but the initial goal of this
|
|
was for it to only be a temporary solution to a wider problem. One of the
|
|
issues with this mechanism is that it is not using a `DLM`_ (distributed
|
|
lock manager) but only a local filesystem lock instead. This means that a
|
|
cinder-api service can mutate the resource (or initiate a request to do
|
|
this) while a second mutation is actively in flight. When a single manager is
|
|
active this will work out (since one of the in flight requests will backup
|
|
behind the external lock). This solves the problem when a
|
|
single *master* manager is running; yet this is an atypical deployment
|
|
pattern and should not be recommended as the way to deploy and run
|
|
cinder (it should be horizontally scalable so that there can be X
|
|
active managers, where X is > 1). We need some other type of solution that
|
|
scales horizontally but also solves the same end goal (disallowing
|
|
simultaneous resource mutation by X entities at the same time).
|
|
|
|
Since the scope of this problem is bigger (it applies to all/most operations
|
|
that act on resources) we have to start somewhere so we will start by working
|
|
through how this will look for the ``create_volume`` workflow. It does raise a
|
|
larger question of how can this change be done in a *piecemeal* fashion since
|
|
the other operations will still be lock dependent, and mixing state transitions
|
|
and lock acquisition techniques will likely not end in a correct solution. We
|
|
will have to explore how to do this in a way that is *piecemeal* but also does
|
|
not destabilize cinder more.
|
|
|
|
Use Cases
|
|
=========
|
|
|
|
Proposed change
|
|
===============
|
|
|
|
Instead of acquiring local filesystem locks in the manager processes refactor
|
|
the concept of a lock to instead be a set of allowed and disallowed state
|
|
transitions (which is in concept similar to the internal mechanism that a
|
|
lock uses anyway).
|
|
|
|
Lets take an abbreviated example of how this could work:
|
|
|
|
When a volume is requested to be created, a database record is created for
|
|
this volume, in this database record there exists a field called
|
|
``status`` (in fact there exists multiple of these statuses fields, in the
|
|
future these should maybe be removed?) that is used to report back to the user
|
|
data about the status of the create volume request as it moves through the
|
|
various components in cinder (api, scheduler, and manager).
|
|
|
|
This status itself has a expected transition diagram and itself is a starting
|
|
point in determining the larger states transitions that a cinder volume create
|
|
request goes through (and is allowed to go through). Instead of overriding
|
|
this ``status`` field this proposal proposes to augment the data storage layer
|
|
in cinder with a new ``resource_states`` table. It may be represented by
|
|
something other than a table depending on where this data is stored (if
|
|
`zookeeper`_ was used it would be represented as a resource tree), the only
|
|
constraint that we *must* enforce is that we can atomically fetch and update
|
|
the given state of a resource in a single atomic operation.
|
|
|
|
A potential schema could look like the following:
|
|
|
|
+--------------------------------------+----------------+---------------------+
|
|
| **Resource** | **State** | **Transitioned on** |
|
|
+======================================+================+=====================+
|
|
| 7c92ee46-7a2e-4183-99c5-909f3d46a90e | CREATING_DB | 2014-05-22T15 |
|
|
| 7c92ee46-7a2e-4183-99c5-909f3d46a90e | SCHEDULING | 2014-05-23T15 |
|
|
| 7c92ee46-7a2e-4183-99c5-909f3d46a90e | CREATING_VOL | 2014-05-23T15 |
|
|
| 7c92ee46-7a2e-4183-99c5-909f3d46a90e | NULL/None | 2014-05-24T15 |
|
|
+--------------------------------------+----------------+---------------------+
|
|
|
|
This table structure will then be used (with ``NULL`` states to delimit when
|
|
a request has fulfilled its set of allowed state transitions) to determine at
|
|
the API level (before a request has been accepted) what a resource is currently
|
|
being used for and the API server can then attempt to initiate a transition to
|
|
a desired state (for example, ``DETACHING``) and depending on if this
|
|
transition is allowed (by looking at the last known state) it may fail or
|
|
succeed at performing this transition. If it succeeds it continues with the
|
|
rest of the workflow for the desired operation (subsequent transitions will
|
|
also be made in the rest of the workflow, as needed, with the final transition
|
|
being a transition to ``NULL/None``, to denote that the operation has
|
|
completed). If the transition is disallowed/fails the API request will be
|
|
denied and the operation will not be allowed to make forward progress (in the
|
|
future this model can be relaxed to allow for simultaneous state transitions
|
|
for operations where this makes sense).
|
|
|
|
To accomplish this, in the ``create_volume`` operation there exists the usage
|
|
of taskflow, which has helped decompose the workflows that volume creation
|
|
goes through (it also makes it possible to resume from a prior state if the
|
|
process crashes). This decomposition makes it obvious (or more obvious) where
|
|
the transitions should occur and what the transitions are. The proposed path
|
|
is to add in new nodes into the workflow that will perform & validate
|
|
these state transitions (attempting to mutate the above resource state table)
|
|
at a granular-enough level to be useful & meaningful (the transition table also
|
|
can be useful for operators and developers attempting to determine what is
|
|
happening inside cinder). When this is combined with notifications from
|
|
taskflow about its own internal `states`_ (via `notifications`_) the ability to
|
|
decipher what is going on internally to cinder becomes very easy & provides
|
|
invaluable information to users, developers and operators using & operating
|
|
cinder.
|
|
|
|
Alternatives
|
|
------------
|
|
|
|
One possibility for avoiding the above ``resource_states`` table is to use
|
|
a `DLM`_ and use a similar approach that is being used with file locks in
|
|
cinder. The usage of it would be similar to the usage of file locks, although
|
|
there are scenarios at RPC boundaries where it would still require state
|
|
transition validation. For example when a lock is released and an async RPC
|
|
call is made there becomes the possibility for other async RPC calls to also
|
|
be active at the same time and there would require a state transition and
|
|
lock system to be used when the receivers of those RPC calls accept and perform
|
|
the requested RPC operation.
|
|
|
|
Another possible solution that does not require state transitions is to not use
|
|
async RPC calls but instead use sync RPC calls, and the sender would only
|
|
release the `DLM`_ lock it owns after it has received confirmation that the
|
|
receiver has started to process (or accepted the request). The receiver would
|
|
then acquire the lock during this period when it accepts the request, ensuring
|
|
that correct lock hand-off happens between the send and receiver. This would
|
|
require a sensitive and hard to get correct lock hand-off code
|
|
path & process (this path would need to be tested heavily to ensure
|
|
correctness).
|
|
|
|
IMHO both of these alternative methods are too fragile and do not make the
|
|
state transition process and diagram obvious to developers, operators, and
|
|
users. This lack of information impedes cinder adoption, and makes it more
|
|
difficult to recovery from (and understand) inevitable failures and
|
|
operational issues.
|
|
|
|
.. _DLM: http://en.wikipedia.org/wiki/Distributed_lock_manager
|
|
|
|
What this does not solve
|
|
------------------------
|
|
|
|
I would also like include a note to what the scope of this specification does
|
|
**not** encompass.
|
|
|
|
* It does **not** encompass cross-project resource usage and
|
|
inconsistencies related to state transitions being done by a project using
|
|
cinder (for example the initiation of a detach of a volume by nova will not
|
|
be aborted early in the nova API flow, but instead will be aborted later in
|
|
the workflow if cinder is performing other state transitions on that
|
|
resource).
|
|
* It also does **not** also stop cinder from deleting a volume underneath
|
|
nova (aka a VM can be using a volume while cinder is deleting it).
|
|
|
|
These are larger cross-project consistency issues and will need to be solved
|
|
at a higher level across the projects. It should be noted that once a project
|
|
itself has a consistent set of states and transitions it becomes *much* easier
|
|
to make cross-project consistency possible (without **internal** consistency
|
|
cross-project resource usage might as well be discouraged/avoided).
|
|
|
|
Data model impact
|
|
-----------------
|
|
|
|
See the above proposed table.
|
|
|
|
Cross-project impact
|
|
--------------------
|
|
|
|
We **must** be careful to retain the existing API so that nova which is
|
|
dependent on cinders currently visible states continues to work. This just
|
|
means that we need to have a exposed mapping that nova is compatible with;
|
|
while we have an internal mapping which is much more detailed and consistent.
|
|
|
|
REST API impact
|
|
---------------
|
|
|
|
Maybe in the future.
|
|
|
|
Security impact
|
|
---------------
|
|
|
|
N/A
|
|
|
|
Notifications impact
|
|
--------------------
|
|
|
|
None currently, the state transition information could also be sent out to
|
|
the notification system if this is desirable in the future to do so.
|
|
|
|
Other end user impact
|
|
---------------------
|
|
|
|
End users should now expect more errors (or try again later) responses when
|
|
performing operations concurrently on the same set of resources. Previously
|
|
some of these operations may or may not have succeeded.
|
|
|
|
Performance Impact
|
|
------------------
|
|
|
|
A new table will be created in `sqlalchemy`_ and a new model will be created
|
|
for this new schema. This table will be high read and write traffic (since all
|
|
operations that occur in cinder will write data to it) so it might be
|
|
recommended to alter the table type to a more friendly format that performs
|
|
better for this tables limited usage. Since this table is relatively simple it
|
|
should also be possible in the future (when correctness is achieved) to
|
|
switch this table to some other backend that can optimize itself for small
|
|
read/writes with little history (history is not as useful, except for operators
|
|
and developers who wish to interrogate what has happened to a
|
|
resource in the past).
|
|
|
|
.. _sqlalchemy: http://www.sqlalchemy.org/
|
|
|
|
Other deployer impact
|
|
---------------------
|
|
|
|
N/A
|
|
|
|
Developer impact
|
|
----------------
|
|
|
|
Developers would likely get a lot of the benefit of this information to start
|
|
since it will help them understand the states a workflow goes through (at
|
|
the cinder level), combining this with the event stream that taskflow emits
|
|
creates a lot of useful runtime information that can be used while running
|
|
cinder or while developing cinder (where to add new state transitions in
|
|
becomes more obvious when the state transitions that occur are well defined
|
|
and understood).
|
|
|
|
Implementation
|
|
==============
|
|
|
|
Assignee(s)
|
|
-----------
|
|
|
|
Primary assignee:
|
|
|
|
* Harlowja
|
|
|
|
Other contributors:
|
|
|
|
* DuncanT
|
|
* Others?
|
|
* You the person reading this?
|
|
|
|
Work Items
|
|
----------
|
|
|
|
* Determine state digram and debate what states should be used internally to
|
|
cinder (the **critical** must-have states) and what states are more
|
|
**informational** (DuncanT has apparently done some of this analysis).
|
|
* Create database schema migration/addition for the decided upon new schema.
|
|
* Create database models for new schema (and determine and discuss on how the
|
|
atomic state update will be accomplished).
|
|
* Identify key locations where these state transitions will occur (before or
|
|
after which taskflow tasks) or at a layer outside of taskflow.
|
|
* Add new tests that trigger these new state transitions and violation checks,
|
|
ensuring that what is desired to occur actually occurs.
|
|
* Simultaneously work on creating a model inside of taskflow that can help
|
|
other projects avoid recreating chunks of the above code for there own
|
|
similar needs/use-cases.
|
|
* Test like *crazy*.
|
|
|
|
* Do load-testing/concurrency-testing (using rally or tempest) to verify the
|
|
improvement has helped and not hurt cinder.
|
|
|
|
Milestones
|
|
----------
|
|
|
|
J/3 into K (this is likely not a short-term specification).
|
|
|
|
Dependencies
|
|
============
|
|
|
|
N/A
|
|
|
|
Testing
|
|
=======
|
|
|
|
Since this change affects how cinder operates at a low level, it will require
|
|
a good amount of testing to verify that concurrent operations are disallowed.
|
|
Currently tempest may not be the best way to test these concurrent operations
|
|
since to my knowledge it does not run in parallel (and only when it runs in
|
|
a controlled parallel process can u find these concurrency issues). So the
|
|
way to test these concurrency issues needs to be determined (is `rally`_ the
|
|
way to go here, using its concurrent scenarios to probe that this
|
|
feature works?).
|
|
|
|
.. _rally: https://wiki.openstack.org/wiki/Rally
|
|
|
|
Documentation Impact
|
|
====================
|
|
|
|
There may be new documentation required to explain why operations that were
|
|
allowed to occur concurrently are no longer allowed to occur concurrently since
|
|
this new state transition will be more strict as to what can and what can not
|
|
occur at the same time.
|
|
|
|
It will also become possible to start to form documents like taskflow
|
|
`states`_ that show exactly what the internals of cinder are doing
|
|
and what the allowed state transitions (aka the cinder reference operation
|
|
states) are.
|
|
|
|
References
|
|
==========
|
|
|
|
**Summit discussion/session:**
|
|
|
|
https://etherpad.openstack.org/p/juno-cinder-state-and-workflow-management
|
|
|
|
.. _states: http://docs.openstack.org/developer/taskflow/states.html
|
|
.. _zookeeper: http://zookeeper.apache.org/
|
|
.. _notifications: http://docs.openstack.org/developer/taskflow/notifications.html
|