91392d5432
Several small RST format fixes, which cause parsing warnings and errors, and render oddly on specs.openstack.org. Discovered while working on a spec parsing tool to replace the retired releasestatus tool. Change-Id: Ice06999e51f1dd997d370273aae750b3c5ad2ee2
317 lines
11 KiB
ReStructuredText
317 lines
11 KiB
ReStructuredText
..
|
|
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
|
License.
|
|
|
|
http://creativecommons.org/licenses/by/3.0/legalcode
|
|
|
|
=============================================================
|
|
Cinder Volume Active/Active support - API Races
|
|
=============================================================
|
|
|
|
https://blueprints.launchpad.net/cinder/+spec/cinder-volume-active-active-support
|
|
|
|
Right now cinder-volume service can run only in Active/Passive HA fashion.
|
|
|
|
One of the reasons for this is non-atomic state transitions in c-api that may
|
|
cause race conditions.
|
|
|
|
This spec proposes a change to do atomic DB changes in Cinder API using
|
|
compare-and-swap to prevent them.
|
|
|
|
|
|
Problem description
|
|
===================
|
|
|
|
Current Cinder API Service code has sections where contents from the DB are
|
|
retrieved at one point and then checked for validity and in case all checks
|
|
pass and the operation can proceed DB is changed accordingly.
|
|
|
|
Elapsed time between the retrieval and the DB changes depends on the nature of
|
|
the validation, in some cases it's just a simple check of the 'status' field,
|
|
while in others it requires checking other tables for example to confirm that a
|
|
volume is not attached or has no snapshots.
|
|
|
|
This way of handling checks and changes in the API creates a window of
|
|
opportunity for code races to occur. These races would happen when changes are
|
|
made to the DB after we have retrieved its contents, leading us to make a
|
|
decision based on outdated data and not only change the DB using obsolete data
|
|
as reference, but in most cases also incorrectly calling Volume Manager's
|
|
Service to perform an operation.
|
|
|
|
One example of these API races is the extend method in cinder/volume/api.py
|
|
where we check at one point the status:
|
|
|
|
if volume['status'] != 'available':
|
|
|
|
Then later on we change the status:
|
|
|
|
self.update(context, volume, {'status': 'extending'})
|
|
|
|
And we finally make an RPC request to perform the operation:
|
|
|
|
self.volume_rpcapi.extend_volume(context, volume, new_size, reservations)
|
|
|
|
|
|
Use Cases
|
|
=========
|
|
|
|
Operator would want to avoid unexpected behavior in their clouds, moreover
|
|
since these race conditions could lead to data corruption in the storage
|
|
back-end. And these are probably situations more likely to happen in an
|
|
Active/Active configuration that can handle a higher volume workload.
|
|
|
|
|
|
Proposed change
|
|
===============
|
|
|
|
Proposed solution is to use compare-and-swap updates, with retries on
|
|
Deadlocks, to ensure that we only update the DB when all required conditions
|
|
are met.
|
|
|
|
Basically a compare-and-swap is just a complex DB query that has a condition
|
|
added to the update query to ensure that modifications to the DB only occurs if
|
|
those conditions are met. They can be simple conditions like status ==
|
|
'available' or more complex conditions referring to other tables.
|
|
|
|
This solution has been chosen over other alternatives because `performance
|
|
numbers`_ in tests show that even under the most extreme conditions in a
|
|
multi-master cluster DB configuration they will have great performance results.
|
|
|
|
The idea is to keep Cinder API service behaving as close as possible to the
|
|
original code, so we will fail and succeed under the same conditions. Even
|
|
though reported errors will be slightly different, the only relevant difference
|
|
will be when we previously would have had a race condition, and possibly data
|
|
corruption, that both API caller would have received an affirmative response to
|
|
their request and now one of them will receive a failure as if both operations
|
|
had been performed sequentially.
|
|
|
|
The small difference in how we report errors comes from the fact that on
|
|
failure we no longer know which of the conditions triggered the failure. So in
|
|
name of efficiency and code clarity we will be returning a generic error
|
|
stating that some of the required conditions were not met and therefore
|
|
requested operation could not be completed.
|
|
|
|
Alternatives
|
|
------------
|
|
|
|
An alternative to returning a generic error on failure, that was rejected for
|
|
its complexity, would be to return the exact same errors we are returning now.
|
|
For that, we would get updated data from the DB and give it to the validation
|
|
method that will check which of the conditions is not met and raise the right
|
|
exception with the right error message. For example this method could be
|
|
checking the ``status`` of a volume and raise a message if it is not
|
|
``available``, and then check that the volume has not snapshots and if it does
|
|
then raise a different exception or the same exception with a different
|
|
message.
|
|
|
|
Skeleton of the procedure::
|
|
|
|
result = resource.conditional_update(values, conditions)
|
|
if not result:
|
|
resource = db.get_resource(resource.id)
|
|
raise_right_validation_error(resource)
|
|
rpc_call(resource)
|
|
|
|
But there is a potential race here, we could be requesting a conditional update
|
|
that fails (because the conditions in the DB are not met), and when we retrieve
|
|
the data from the DB it has just changed (now it would meet the conditions for
|
|
the update), so when we call the method in charge of raising the right
|
|
validation error with this changed data it will not find any reason why we
|
|
cannot perform the operation, so it will not raise any exception and will just
|
|
return control to the caller.
|
|
|
|
In that case, since we didn't raise an error, we would continue with the rpc
|
|
call without having changed the DB, which is a problem. That is why we need a
|
|
loop in the conditional update, to make sure we either succeed on the update or
|
|
we raise the error regardless of racing conditions on the data.
|
|
|
|
Updated skeleton of the procedure::
|
|
|
|
while not resource.conditional_update(values, conditions):
|
|
resource = db.get_resource(resource.id)
|
|
raise_right_validation_error(resource)
|
|
rpc_call(resource)
|
|
|
|
In this implementation it would be a good idea to have a maximum number of
|
|
retries instead of just looping until we either update the DB or can properly
|
|
raise an error. That way we would not have an infinite loop if we have an
|
|
error in our conditional update or our validation code.
|
|
|
|
There are also multiple alternatives to compare-and-swap that have been
|
|
explored and discarded for being less efficient - as in being slower or
|
|
requiring more queries to the DB.
|
|
|
|
Discarded alternatives are:
|
|
|
|
- SELECT ... FOR UPDATE, with retries on Deadlocks, which even though it works
|
|
with multi-master configurations it has more Deadlock retries than proposed
|
|
solution.
|
|
|
|
- Using a Distributed Locking Manager with different backends to enforce
|
|
exclusive access to resources.
|
|
|
|
More information on the tests can be found in `performance numbers`_.
|
|
|
|
Data model impact
|
|
-----------------
|
|
|
|
None
|
|
|
|
REST API impact
|
|
---------------
|
|
|
|
None
|
|
|
|
Security impact
|
|
---------------
|
|
|
|
None
|
|
|
|
Notifications impact
|
|
--------------------
|
|
|
|
None
|
|
|
|
Other end user impact
|
|
---------------------
|
|
|
|
None
|
|
|
|
Performance Impact
|
|
------------------
|
|
|
|
Performance impact will be negligible, and may even have better performance
|
|
since there will be only one query to the DB instead of multiple queries. For
|
|
example on volume deletion, where we now have one query to see the status and
|
|
another to get the count of snapshots for that volume. With the new
|
|
compare-and-swap we will only have 1 query that will update the DB if the
|
|
status is correct and no snapshot exist.
|
|
|
|
Other deployer impact
|
|
---------------------
|
|
|
|
None
|
|
|
|
|
|
Developer impact
|
|
----------------
|
|
|
|
Once changes are made to Cinder API methods, all new API methods that are added
|
|
will need to conform to this new compare-and-swap way of updating the DB to
|
|
prevent new race conditions from entering the code base.
|
|
|
|
|
|
Implementation
|
|
==============
|
|
|
|
Assignee(s)
|
|
-----------
|
|
|
|
Primary assignee:
|
|
Gorka Eguileor (geguileo)
|
|
|
|
Other contributors:
|
|
Anyone is welcome to help
|
|
|
|
Work Items
|
|
----------
|
|
|
|
* Atomic update method for DB.
|
|
|
|
This method will require certain capabilities:
|
|
|
|
- Basic comparing:
|
|
status = 'available'
|
|
status != 'available'
|
|
|
|
- Comparing of multiple values:
|
|
status in ['available', 'error']
|
|
status not in ['attaching', 'detaching']
|
|
|
|
- Handling None values like Python does: When doing a negative comparison
|
|
against a non None value it should return None values as well. So checking
|
|
for 'migration_status' != 'migrating' would also return values that have
|
|
'migration_status' set to None.
|
|
|
|
- Using additional complex filters: In some cases, like checking for
|
|
attachments, more complex queries may be needed.
|
|
|
|
- Set update values depending on other fields. For example when detaching a
|
|
volume you will have to set the status to 'available' if there are no more
|
|
volume attachments for that volume, and set it to 'in-use' if there are more
|
|
attachments (it was multi attached).
|
|
|
|
- Set update values based on fields on the DB:
|
|
previous_status = status, status = 'retyping'
|
|
size = size + 10
|
|
|
|
|
|
* Atomic update method in Versioned Objects.
|
|
|
|
Besides exposing the functionality provided by above DB conditional update item
|
|
it also needs to:
|
|
|
|
- Automatically add the ID of the resource to the conditional update.
|
|
|
|
- If no conditions are given it must assume we want the DB record to be
|
|
unaltered.
|
|
|
|
- It must allow saving dirty attributes together with the update.
|
|
|
|
- It must be able to update the versioned object with data that has been
|
|
written to the DB, even on conditional values.
|
|
|
|
|
|
* Update Cinder API Service methods to use compare-and-swap.
|
|
|
|
|
|
Dependencies
|
|
============
|
|
|
|
No openstack dependencies, but it has a SQLAlchemy depencency on version 1.0.10
|
|
that includes Parameter-Ordered Updates (`issue #3541`_).
|
|
|
|
On some DBs' update method is order dependent, so they behave differently
|
|
depending on the order of the values, example on a volume with 'available'
|
|
status:
|
|
|
|
UPDATE volumes SET previous_status=status, status='retyping' WHERE
|
|
id='44f284f9-877d-4fce-9eb4-67a052410054';
|
|
|
|
Will result in a volume with 'retyping' status and 'available' previous_status
|
|
both on SQLite and MariaDB, but
|
|
|
|
UPDATE volumes SET status='retyping', previous_status=status WHERE
|
|
id='44f284f9-877d-4fce-9eb4-67a052410054';
|
|
|
|
Will yield the same result in SQLite but will result in a volume with status
|
|
and previous_status set to 'retyping' in MariaDB, which is not what we want, so
|
|
order must be taken into consideration.
|
|
|
|
|
|
Testing
|
|
=======
|
|
|
|
Unit tests will be added for Atomic Update methods.
|
|
|
|
|
|
Documentation Impact
|
|
====================
|
|
|
|
None
|
|
|
|
|
|
References
|
|
==========
|
|
|
|
.. _performance numbers: http://gorka.eguileor.com/cinders-api-races/
|
|
.. _issue #3541: https://bitbucket.org/zzzeek/sqlalchemy/issues/3541/
|
|
|
|
* https://etherpad.openstack.org/p/cinder-active-active-vol-service-issues
|
|
* https://review.openstack.org/#/c/205834
|
|
* https://review.openstack.org/#/c/216377
|
|
* https://review.openstack.org/#/c/205835
|
|
* https://review.openstack.org/#/c/216378
|
|
* https://review.openstack.org/#/c/221441
|
|
* https://review.openstack.org/#/c/221442
|