From 900f74440172c40edadb07104545d314e17e9e0d Mon Sep 17 00:00:00 2001
From: Gorka Eguileor <geguileo@redhat.com>
Date: Wed, 29 Jul 2015 17:38:27 +0200
Subject: [PATCH] Remove Cinder API races

This spec outlines issues needed to achieve Cinder A/A HA and proposes
changes to c-api to remove race conditions when writing to the same
resource simultaneously in 2 operations.

Implements: blueprint cinder-volume-active-active-support
Change-Id: If6142d7ccddba25d4fdef4fe1e689358140fa2d9
---
 .../cinder-api-atomic-status-change.rst       | 314 ++++++++++++++++++
 1 file changed, 314 insertions(+)
 create mode 100644 specs/mitaka/cinder-api-atomic-status-change.rst

diff --git a/specs/mitaka/cinder-api-atomic-status-change.rst b/specs/mitaka/cinder-api-atomic-status-change.rst
new file mode 100644
index 00000000..0b36b868
--- /dev/null
+++ b/specs/mitaka/cinder-api-atomic-status-change.rst
@@ -0,0 +1,314 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+=============================================================
+Cinder Volume Active/Active support - API Races
+=============================================================
+
+https://blueprints.launchpad.net/cinder/+spec/cinder-volume-active-active-support
+
+Right now cinder-volume service can run only in Active/Passive HA fashion.
+
+One of the reasons for this is non-atomic state transitions in c-api that may
+cause race conditions.
+
+This spec proposes a change to do atomic DB changes in Cinder API using
+compare-and-swap to prevent them.
+
+
+Problem description
+===================
+
+Current Cinder API Service code has sections where contents from the DB are
+retrieved at one point and then checked for validity and in case all checks
+pass and the operation can proceed DB is changed accordingly.
+
+Elapsed time between the retrieval and the DB changes depends on the nature of
+the validation, in some cases it's just a simple check of the 'status' field,
+while in others it requires checking other tables for example to confirm that a
+volume is not attached or has no snapshots.
+
+This way of handling checks and changes in the API creates a window of
+opportunity for code races to occur. These races would happen when changes are
+made to the DB after we have retrieved its contents, leading us to make a
+decision based on outdated data and not only change the DB using obsolete data
+as reference, but in most cases also incorrectly calling Volume Manager's
+Service to perform an operation.
+
+One example of these API races is the extend method in cinder/volume/api.py
+where we check at one point the status:
+
+ if volume['status'] != 'available':
+
+Then later on we change the status:
+
+ self.update(context, volume, {'status': 'extending'})
+
+And we finally make an RPC request to perform the operation:
+
+ self.volume_rpcapi.extend_volume(context, volume, new_size, reservations)
+
+
+Use Cases
+=========
+
+Operator would want to avoid unexpected behavior in their clouds, moreover
+since these race conditions could lead to data corruption in the storage
+back-end.  And these are probably situations more likely to happen in an
+Active/Active configuration that can handle a higher volume workload.
+
+
+Proposed change
+===============
+
+Proposed solution is to use compare-and-swap updates, with retries on
+Deadlocks, to ensure that we only update the DB when all required conditions
+are met.
+
+Basically a compare-and-swap is just a complex DB query that has a condition
+added to the update query to ensure that modifications to the DB only occurs if
+those conditions are met.  They can be simple conditions like status ==
+'available' or more complex conditions referring to other tables.
+
+This solution has been chosen over other alternatives because `performance
+numbers`_ in tests show that even under the most extreme conditions in a
+multi-master cluster DB configuration they will have great performance results.
+
+The idea is to keep Cinder API service behaving as close as possible to the
+original code, so we will fail and succeed under the same conditions.  Even
+though reported errors will be slightly different, the only relevant difference
+will be when we previously would have had a race condition, and possibly data
+corruption, that both API caller would have received an affirmative response to
+their request and now one of them will receive a failure as if both operations
+had been performed sequentially.
+
+The small difference in how we report errors comes from the fact that on
+failure we no longer know which of the conditions triggered the failure.  So in
+name of efficiency and code clarity we will be returning a generic error
+stating that some of the required conditions were not met and therefore
+requested operation could not be completed.
+
+Alternatives
+------------
+
+An alternative to returning a generic error on failure, that was rejected for
+its complexity, would be to return the exact same errors we are returning now.
+For that, we would get updated data from the DB and give it to the validation
+method that will check which of the conditions is not met and raise the right
+exception with the right error message.  For example this method could be
+checking the ``status`` of a volume and raise a message if it is not
+``available``, and then check that the volume has not snapshots and if it does
+then raise a different exception or the same exception with a different
+message.
+
+Skeleton of the procedure:
+ result = resource.conditional_update(values, conditions)
+ if not result:
+     resource = db.get_resource(resource.id)
+     raise_right_validation_error(resource)
+ rpc_call(resource)
+
+But there is a potential race here, we could be requesting a conditional update
+that fails (because the conditions in the DB are not met), and when we retrieve
+the data from the DB it has just changed (now it would meet the conditions for
+the update), so when we call the method in charge of raising the right
+validation error with this changed data it will not find any reason why we
+cannot perform the operation, so it will not raise any exception and will just
+return control to the caller.
+
+In that case, since we didn't raise an error, we would continue with the rpc
+call without having changed the DB, which is a problem.  That is why we need a
+loop in the conditional update, to make sure we either succeed on the update or
+we raise the error regardless of racing conditions on the data.
+
+Updated skeleton of the procedure:
+ while not resource.conditional_update(values, conditions):
+     resource = db.get_resource(resource.id)
+     raise_right_validation_error(resource)
+ rpc_call(resource)
+
+In this implementation it would be a good idea to have a maximum number of
+retries instead of just looping until we either update the DB or can properly
+raise an error.  That way we would not have an infinite loop if we have an
+error in our conditional update or our validation code.
+
+There are also multiple alternatives to compare-and-swap that have been
+explored and discarded for being less efficient - as in being slower or
+requiring more queries to the DB.
+
+Discarded alternatives are:
+
+- SELECT ...  FOR UPDATE, with retries on Deadlocks, which even though it works
+  with multi-master configurations it has more Deadlock retries than proposed
+  solution.
+
+- Using a Distributed Locking Manager with different backends to enforce
+  exclusive access to resources.
+
+More information on the tests can be found in `performance numbers`_.
+
+Data model impact
+-----------------
+
+None
+
+REST API impact
+---------------
+
+None
+
+Security impact
+---------------
+
+None
+
+Notifications impact
+--------------------
+
+None
+
+Other end user impact
+---------------------
+
+None
+
+Performance Impact
+------------------
+
+Performance impact will be negligible, and may even have better performance
+since there will be only one query to the DB instead of multiple queries.  For
+example on volume deletion, where we now have one query to see the status and
+another to get the count of snapshots for that volume.  With the new
+compare-and-swap we will only have 1 query that will update the DB if the
+status is correct and no snapshot exist.
+
+Other deployer impact
+---------------------
+
+None
+
+
+Developer impact
+----------------
+
+Once changes are made to Cinder API methods, all new API methods that are added
+will need to conform to this new compare-and-swap way of updating the DB to
+prevent new race conditions from entering the code base.
+
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Primary assignee:
+  Gorka Eguileor (geguileo)
+
+Other contributors:
+  Anyone is welcome to help
+
+Work Items
+----------
+
+* Atomic update method for DB.
+
+This method will require certain capabilities:
+
+- Basic comparing:
+  status = 'available'
+  status != 'available'
+
+- Comparing of multiple values:
+  status in ['available', 'error']
+  status not in ['attaching', 'detaching']
+
+- Handling None values like Python does: When doing a negative comparison
+  against a non None value it should return None values as well. So checking
+  for 'migration_status' != 'migrating' would also return values that have
+  'migration_status' set to None.
+
+- Using additional complex filters: In some cases, like checking for
+  attachments, more complex queries may be needed.
+
+- Set update values depending on other fields.  For example when detaching a
+  volume you will have to set the status to 'available' if there are no more
+  volume attachments for that volume, and set it to 'in-use' if there are more
+  attachments (it was multi attached).
+
+- Set update values based on fields on the DB:
+  previous_status = status, status = 'retyping'
+  size = size + 10
+
+
+* Atomic update method in Versioned Objects.
+
+Besides exposing the functionality provided by above DB conditional update item
+it also needs to:
+
+- Automatically add the ID of the resource to the conditional update.
+
+- If no conditions are given it must assume we want the DB record to be
+  unaltered.
+
+- It must allow saving dirty attributes together with the update.
+
+- It must be able to update the versioned object with data that has been
+  written to the DB, even on conditional values.
+
+
+* Update Cinder API Service methods to use compare-and-swap.
+
+
+Dependencies
+============
+
+No openstack dependencies, but it has a SQLAlchemy depencency on version 1.0.10
+that includes Parameter-Ordered Updates (`issue #3541`_).
+
+On some DBs' update method is order dependent, so they behave differently
+depending on the order of the values, example on a volume with 'available'
+status:
+
+   UPDATE volumes SET previous_status=status, status='retyping' WHERE
+   id='44f284f9-877d-4fce-9eb4-67a052410054';
+
+Will result in a volume with 'retyping' status and 'available' previous_status
+both on SQLite and MariaDB, but
+
+   UPDATE volumes SET status='retyping', previous_status=status WHERE
+   id='44f284f9-877d-4fce-9eb4-67a052410054';
+
+Will yield the same result in SQLite but will result in a volume with status
+and previous_status set to 'retyping' in MariaDB, which is not what we want, so
+order must be taken into consideration.
+
+
+Testing
+=======
+
+Unit tests will be added for Atomic Update methods.
+
+
+Documentation Impact
+====================
+
+None
+
+
+References
+==========
+
+.. _performance numbers: http://gorka.eguileor.com/cinders-api-races/
+.. _issue #3541: https://bitbucket.org/zzzeek/sqlalchemy/issues/3541/
+
+* https://etherpad.openstack.org/p/cinder-active-active-vol-service-issues
+* https://review.openstack.org/#/c/205834
+* https://review.openstack.org/#/c/216377
+* https://review.openstack.org/#/c/205835
+* https://review.openstack.org/#/c/216378
+* https://review.openstack.org/#/c/221441
+* https://review.openstack.org/#/c/221442