Merge "Remove Cinder API races"
This commit is contained in:
commit
8f0309c9f5
314
specs/mitaka/cinder-api-atomic-status-change.rst
Normal file
314
specs/mitaka/cinder-api-atomic-status-change.rst
Normal file
@ -0,0 +1,314 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
=============================================================
|
||||
Cinder Volume Active/Active support - API Races
|
||||
=============================================================
|
||||
|
||||
https://blueprints.launchpad.net/cinder/+spec/cinder-volume-active-active-support
|
||||
|
||||
Right now cinder-volume service can run only in Active/Passive HA fashion.
|
||||
|
||||
One of the reasons for this is non-atomic state transitions in c-api that may
|
||||
cause race conditions.
|
||||
|
||||
This spec proposes a change to do atomic DB changes in Cinder API using
|
||||
compare-and-swap to prevent them.
|
||||
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
Current Cinder API Service code has sections where contents from the DB are
|
||||
retrieved at one point and then checked for validity and in case all checks
|
||||
pass and the operation can proceed DB is changed accordingly.
|
||||
|
||||
Elapsed time between the retrieval and the DB changes depends on the nature of
|
||||
the validation, in some cases it's just a simple check of the 'status' field,
|
||||
while in others it requires checking other tables for example to confirm that a
|
||||
volume is not attached or has no snapshots.
|
||||
|
||||
This way of handling checks and changes in the API creates a window of
|
||||
opportunity for code races to occur. These races would happen when changes are
|
||||
made to the DB after we have retrieved its contents, leading us to make a
|
||||
decision based on outdated data and not only change the DB using obsolete data
|
||||
as reference, but in most cases also incorrectly calling Volume Manager's
|
||||
Service to perform an operation.
|
||||
|
||||
One example of these API races is the extend method in cinder/volume/api.py
|
||||
where we check at one point the status:
|
||||
|
||||
if volume['status'] != 'available':
|
||||
|
||||
Then later on we change the status:
|
||||
|
||||
self.update(context, volume, {'status': 'extending'})
|
||||
|
||||
And we finally make an RPC request to perform the operation:
|
||||
|
||||
self.volume_rpcapi.extend_volume(context, volume, new_size, reservations)
|
||||
|
||||
|
||||
Use Cases
|
||||
=========
|
||||
|
||||
Operator would want to avoid unexpected behavior in their clouds, moreover
|
||||
since these race conditions could lead to data corruption in the storage
|
||||
back-end. And these are probably situations more likely to happen in an
|
||||
Active/Active configuration that can handle a higher volume workload.
|
||||
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
Proposed solution is to use compare-and-swap updates, with retries on
|
||||
Deadlocks, to ensure that we only update the DB when all required conditions
|
||||
are met.
|
||||
|
||||
Basically a compare-and-swap is just a complex DB query that has a condition
|
||||
added to the update query to ensure that modifications to the DB only occurs if
|
||||
those conditions are met. They can be simple conditions like status ==
|
||||
'available' or more complex conditions referring to other tables.
|
||||
|
||||
This solution has been chosen over other alternatives because `performance
|
||||
numbers`_ in tests show that even under the most extreme conditions in a
|
||||
multi-master cluster DB configuration they will have great performance results.
|
||||
|
||||
The idea is to keep Cinder API service behaving as close as possible to the
|
||||
original code, so we will fail and succeed under the same conditions. Even
|
||||
though reported errors will be slightly different, the only relevant difference
|
||||
will be when we previously would have had a race condition, and possibly data
|
||||
corruption, that both API caller would have received an affirmative response to
|
||||
their request and now one of them will receive a failure as if both operations
|
||||
had been performed sequentially.
|
||||
|
||||
The small difference in how we report errors comes from the fact that on
|
||||
failure we no longer know which of the conditions triggered the failure. So in
|
||||
name of efficiency and code clarity we will be returning a generic error
|
||||
stating that some of the required conditions were not met and therefore
|
||||
requested operation could not be completed.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
An alternative to returning a generic error on failure, that was rejected for
|
||||
its complexity, would be to return the exact same errors we are returning now.
|
||||
For that, we would get updated data from the DB and give it to the validation
|
||||
method that will check which of the conditions is not met and raise the right
|
||||
exception with the right error message. For example this method could be
|
||||
checking the ``status`` of a volume and raise a message if it is not
|
||||
``available``, and then check that the volume has not snapshots and if it does
|
||||
then raise a different exception or the same exception with a different
|
||||
message.
|
||||
|
||||
Skeleton of the procedure:
|
||||
result = resource.conditional_update(values, conditions)
|
||||
if not result:
|
||||
resource = db.get_resource(resource.id)
|
||||
raise_right_validation_error(resource)
|
||||
rpc_call(resource)
|
||||
|
||||
But there is a potential race here, we could be requesting a conditional update
|
||||
that fails (because the conditions in the DB are not met), and when we retrieve
|
||||
the data from the DB it has just changed (now it would meet the conditions for
|
||||
the update), so when we call the method in charge of raising the right
|
||||
validation error with this changed data it will not find any reason why we
|
||||
cannot perform the operation, so it will not raise any exception and will just
|
||||
return control to the caller.
|
||||
|
||||
In that case, since we didn't raise an error, we would continue with the rpc
|
||||
call without having changed the DB, which is a problem. That is why we need a
|
||||
loop in the conditional update, to make sure we either succeed on the update or
|
||||
we raise the error regardless of racing conditions on the data.
|
||||
|
||||
Updated skeleton of the procedure:
|
||||
while not resource.conditional_update(values, conditions):
|
||||
resource = db.get_resource(resource.id)
|
||||
raise_right_validation_error(resource)
|
||||
rpc_call(resource)
|
||||
|
||||
In this implementation it would be a good idea to have a maximum number of
|
||||
retries instead of just looping until we either update the DB or can properly
|
||||
raise an error. That way we would not have an infinite loop if we have an
|
||||
error in our conditional update or our validation code.
|
||||
|
||||
There are also multiple alternatives to compare-and-swap that have been
|
||||
explored and discarded for being less efficient - as in being slower or
|
||||
requiring more queries to the DB.
|
||||
|
||||
Discarded alternatives are:
|
||||
|
||||
- SELECT ... FOR UPDATE, with retries on Deadlocks, which even though it works
|
||||
with multi-master configurations it has more Deadlock retries than proposed
|
||||
solution.
|
||||
|
||||
- Using a Distributed Locking Manager with different backends to enforce
|
||||
exclusive access to resources.
|
||||
|
||||
More information on the tests can be found in `performance numbers`_.
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
None
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
None
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
Performance impact will be negligible, and may even have better performance
|
||||
since there will be only one query to the DB instead of multiple queries. For
|
||||
example on volume deletion, where we now have one query to see the status and
|
||||
another to get the count of snapshots for that volume. With the new
|
||||
compare-and-swap we will only have 1 query that will update the DB if the
|
||||
status is correct and no snapshot exist.
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
Once changes are made to Cinder API methods, all new API methods that are added
|
||||
will need to conform to this new compare-and-swap way of updating the DB to
|
||||
prevent new race conditions from entering the code base.
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
Gorka Eguileor (geguileo)
|
||||
|
||||
Other contributors:
|
||||
Anyone is welcome to help
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Atomic update method for DB.
|
||||
|
||||
This method will require certain capabilities:
|
||||
|
||||
- Basic comparing:
|
||||
status = 'available'
|
||||
status != 'available'
|
||||
|
||||
- Comparing of multiple values:
|
||||
status in ['available', 'error']
|
||||
status not in ['attaching', 'detaching']
|
||||
|
||||
- Handling None values like Python does: When doing a negative comparison
|
||||
against a non None value it should return None values as well. So checking
|
||||
for 'migration_status' != 'migrating' would also return values that have
|
||||
'migration_status' set to None.
|
||||
|
||||
- Using additional complex filters: In some cases, like checking for
|
||||
attachments, more complex queries may be needed.
|
||||
|
||||
- Set update values depending on other fields. For example when detaching a
|
||||
volume you will have to set the status to 'available' if there are no more
|
||||
volume attachments for that volume, and set it to 'in-use' if there are more
|
||||
attachments (it was multi attached).
|
||||
|
||||
- Set update values based on fields on the DB:
|
||||
previous_status = status, status = 'retyping'
|
||||
size = size + 10
|
||||
|
||||
|
||||
* Atomic update method in Versioned Objects.
|
||||
|
||||
Besides exposing the functionality provided by above DB conditional update item
|
||||
it also needs to:
|
||||
|
||||
- Automatically add the ID of the resource to the conditional update.
|
||||
|
||||
- If no conditions are given it must assume we want the DB record to be
|
||||
unaltered.
|
||||
|
||||
- It must allow saving dirty attributes together with the update.
|
||||
|
||||
- It must be able to update the versioned object with data that has been
|
||||
written to the DB, even on conditional values.
|
||||
|
||||
|
||||
* Update Cinder API Service methods to use compare-and-swap.
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
No openstack dependencies, but it has a SQLAlchemy depencency on version 1.0.10
|
||||
that includes Parameter-Ordered Updates (`issue #3541`_).
|
||||
|
||||
On some DBs' update method is order dependent, so they behave differently
|
||||
depending on the order of the values, example on a volume with 'available'
|
||||
status:
|
||||
|
||||
UPDATE volumes SET previous_status=status, status='retyping' WHERE
|
||||
id='44f284f9-877d-4fce-9eb4-67a052410054';
|
||||
|
||||
Will result in a volume with 'retyping' status and 'available' previous_status
|
||||
both on SQLite and MariaDB, but
|
||||
|
||||
UPDATE volumes SET status='retyping', previous_status=status WHERE
|
||||
id='44f284f9-877d-4fce-9eb4-67a052410054';
|
||||
|
||||
Will yield the same result in SQLite but will result in a volume with status
|
||||
and previous_status set to 'retyping' in MariaDB, which is not what we want, so
|
||||
order must be taken into consideration.
|
||||
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Unit tests will be added for Atomic Update methods.
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
None
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
.. _performance numbers: http://gorka.eguileor.com/cinders-api-races/
|
||||
.. _issue #3541: https://bitbucket.org/zzzeek/sqlalchemy/issues/3541/
|
||||
|
||||
* https://etherpad.openstack.org/p/cinder-active-active-vol-service-issues
|
||||
* https://review.openstack.org/#/c/205834
|
||||
* https://review.openstack.org/#/c/216377
|
||||
* https://review.openstack.org/#/c/205835
|
||||
* https://review.openstack.org/#/c/216378
|
||||
* https://review.openstack.org/#/c/221441
|
||||
* https://review.openstack.org/#/c/221442
|
Loading…
x
Reference in New Issue
Block a user