Merge "Job Distribution to support HA A/A"

2016-05-03 19:05:25 +00:00 · 2016-05-03 19:05:25 +00:00 · 73afc6b7a8
commit 73afc6b7a8
parent 8a42d06b71 cfd1d7377e
1 changed files with 434 additions and 0 deletions
--- a/specs/newton/ha-aa-job-distribution.rst
+++ b/specs/newton/ha-aa-job-distribution.rst
@ -0,0 +1,434 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+=============================================================
+Cinder Volume Active/Active support - Job Distribution
+=============================================================
+
+https://blueprints.launchpad.net/cinder/+spec/cinder-volume-active-active-support
+
+Right now cinder-volume service can run only in Active/Passive HA fashion.
+
+One of the reasons for this is that we have no concept of a cluster of nodes
+that handle the same storage back-end.
+
+This spec introduces a slight modification to the `host` concept in Cinder to
+include the concept of a cluster as well and provide a way for Cinder's API and
+Scheduler nodes to distribute jobs to Volume nodes on a High Availability
+deployment with Active/Active configuration.
+
+
+Problem description
+===================
+
+Right now cinder-volume service only accepts Active/Passive High Availability
+configurations, and current job distribution mechanism does not allow having
+multiple hosts grouped in a cluster where jobs can be queued to be processed by
+any of those nodes.
+
+Jobs are currently being distributed using a topic based message queue that is
+identified by the `volume_topic`, `scheduler_topic`, or `backup_topic` prefix
+joined with the host name and possibly the backend name if it's a multibackend
+node like in `cinder-volume.localhost@lvm`, and that's the mechanism used to
+send jobs to the Volume nodes regardless of the physical address of the node
+that is going to be handling the job, allowing an easier transition on
+failover.
+
+Chosen solution must be backward compatible as well as allow the new
+Active/Active configuration to effectively send jobs.
+
+In the Active/Active configuration there can be multiple Volume nodes - this is
+not mandatory at all times, as failures may leave us with only 1 active node -
+with different `host` configuration values that can interchangeably accept jobs
+that are handling the same storage backend.
+
+
+Use Cases
+=========
+
+Operators that have hard requirements, SLA or other reasons, to have their
+cloud operational at all times or have higher throughput requirements will want
+to have the possibility to configure their deployments with an Active/Active
+configuration.
+
+
+Proposed change
+===============
+
+Basic mechanism
+---------------
+
+To provide a mechanism that will allow us to distribute jobs to a group of
+nodes we'll make a slight change in the meaning of the `host` configuration
+option in Cinder.  Instead of just being "the name of this node" as it is now,
+it will be the "logical host name that groups a set of - one or many -
+homogeneous nodes under a common name. This host set will behave as one entity
+to the callers".
+
+That way the `host` configuration option will uniquely identify a group of
+Volume nodes that share the same storage backends and therefore can accept jobs
+for the same volumes interchangeably.
+
+This means that now `host` can have the same value on multiple volume nodes,
+with the only requisite that all nodes that share the same value must also
+share the same storage backends and configuration.
+
+Since now we are using `host` as a logical grouping we need some other way to
+uniquely identify each node, and that's where a new configuration option called
+`node_name` comes in.
+
+By default both `node_name` and `host` will take the value of
+`socket.gethostname()`, this will allow us to be upgrade compatible with
+deployments that are currently using `host` field to work with unsupported
+Active/Active configurations.
+
+By using the same `host` configuration option that we were previously using we
+are able to keep the messaging queue topics as they were and we ensure not only
+backward compatibility but also conformity with rolling upgrades requirements.
+
+One benefit of this solution is that it will be very easy to change a
+deployment from Single node or Active/Passive to Active/Active without any
+downtime. All that is needed is to configure the same `host` configuration
+option that the active node has and configure storage backends in the same way,
+`node_name` must be different though.
+
+This new mechanism will promote the host "service unit" to allow grouping and
+will add a new node unit that will equate to the old host unit in order to
+maintain the granularity of some operations.
+
+Heartbeats
+----------
+
+With Active/Passive configurations a storage backend service is down whenever
+we don't have a valid heartbeat from the host and is up if we do.  These
+heartbeats are reported in the DB in ``services`` table.
+
+On Active/Active configurations a service is down if there is no valid
+heartbeat from any of the nodes that constitute the host set, and it is up if
+there is at least one valid heartbeat.
+
+Since we are moving from a one to one relationship between hosts and services
+to a many to one relationship, we will need to make some changes to the DB
+services table as explained in `Data model impact`_.
+
+Each node backend will be reporting individual heartbeats and the scheduler
+will aggregate this information to know which backends are up/down grouping by
+the `host` field.  This way we'll also be able to tell which specific nodes are
+up/down based on their individual heartbeats.
+
+We'll also need to update the REST API to report detailed information of the
+status of the different nodes in the cluster as explained in `REST API
+impact`_.
+
+Disabling
+---------
+
+Even though we'll now have multiple nodes working under the same `host` we
+won't be changing disabling behavior in any way.  Disabling a service will
+still prevent schedulers from taking that service into consideration during
+filtering and weighting and the service will still be reachable for all
+operations that don't go through the scheduler.
+
+It stands to reason that sometimes we'll need to drain nodes to remove them
+from a service group, but this spec and its implementation will not be adding
+any new mechanism for that.  So existing mechanism should be used to perform
+graceful shutdown of c-vol nodes.
+
+Current graceful shutdown mechanism will make sure that no new operations are
+received from the AMQP queue while it waits for ongoing operations to complete
+before stopping.
+
+It is important to remember that graceful shutdown has a timeout that will
+forcefully stop operations if they take longer than the configured value.
+Configuration option is called `graceful_shutdown_timeout`, goes in [DEFAULT]
+section and takes a default value of 60 seconds; so this should be configured
+in our deployments if we think this is not long enough for our use cases.
+
+Capabilities
+------------
+
+All Volume Nodes periodically report their capabilities to the schedulers to
+keep them updated with their stats, that way they can make informed decisions
+on where to perform operations.
+
+In a similar way to the Service state reporting we need to prevent concurrent
+access to the data structure when updating this information. Fortunately for us
+we are storing this information in a Python dictionary on the schedulers, and
+since we are using an eventlet executor for the RPC server we don’t have to
+worry about using locks, the inherent behavior of the executor will prevent
+concurrent access to the dictionary.  So no changes are needed there to have
+exclusive access to the data structure.
+
+Although rare, we could have a consistency problem among nodes where different
+schedulers would not have the same information for a given backend.
+
+When we had only 1 node reporting for each given backend this was not a
+situation that could happen, since received capabilities report was always the
+latest and all scheduler nodes were in sync.  But now that we have multiple
+nodes reporting on the same backend we could receive two reports from different
+Volume nodes on the same backend and they could be processed in different order
+on different nodes, thus making as have different data on each
+scheduler.
+
+The reason why we can't assure that all schedulers will have the same
+capabilities stored in their internal structures is because capabilities
+reports can be processed in a different order on different nodes.  Order is
+preserved in *almost all* stages, nodes report in a specific order and message
+broker preserves this order and they are even delivered in the same order, but
+when each node processes them we can have greenthreads execution in different
+order on different nodes thus ending up with different data on each node.
+
+This case could probably be ignored since it's very rare and differences would
+be small, but in the interest of consistent of the backend capabilities on
+Scheduler nodes, we will timestamp the capabilities on the Volume nodes before
+they are sent to the scheduler, instead of doing it on the scheduler as we are
+doing now. And then we'll have schedulers drop any capabilities that is older
+than the one in the data structure.
+
+By making this change we facilitate new features related to capability
+reporting, like capability caching.  Since capability gathering is usually an
+expensive operation and in Active-Active configurations we'll have multiple
+nodes requesting the same capabilities with the same frequency for the same
+back-end, so capability caching could be a good solution to decrease the cost
+of the gathering on the backend.
+
+Alternatives
+------------
+
+One alternative to proposed job distribution would be to leave the topic queues
+as they are and move the job distribution logic to the scheduler.
+
+The scheduler would receive a job and then send it to one of the hosts that
+belong to the same cluster and is not down.
+
+This method has one problem, and that is that we could be sending a job to a
+node that is down but whose heartbeat hasn't expired yet, or one that has gone
+down before getting the job from the queue.  In these cases we would end up
+with a job that is not being processed by anyone and we would need to either
+wait for the node to go back up or the scheduler would need to retrieve that
+message from the queue and send it to another active node.
+
+An alternative to proposed heartbeats is that all services report using
+`cluster@backend` instead of `host@backend` like they are doing now and as long
+as we have a valid heartbeat we know that the service is up.
+
+There are 2 reasons why I believe that sending independent heartbeats is a
+superior solution, even if we need to modify the DB tables:
+
+- Higher information granularity: We can report not only which services are
+  up/down but also which nodes are up/down.
+
+- It will help us on job cleanup of failed nodes that do not come back up.
+  Although cleanup is not part of this spec, it is good to keep it in mind and
+  facilitate it as much as possible.
+
+Another alternative for the job distribution, which was the proposed solution
+in previous versions of this specification, was to add a new configuration
+option called ``cluster`` instead of ``node`` as we propose now.
+
+In that case message topics would change and instead of using ``host`` for them
+they would use this new ``cluster`` option that would default when undefined to
+the same value as ``host``.
+
+The main advantage of this alternative is that it's easier to understand the
+concepts of a cluster comprised of hosts than a conceptual host entity composed
+of interchangeable nodes.
+
+Unfortunately this lone advantage is easily outweighed by the overwhelming
+number of disadvantages that presents:
+
+- Since we cannot rename DB fields directly with rolling upgrades we have to
+  make progressive changes through 3 releases to reach desired state of
+  renaming ``host`` field to ``cluster`` field in tables ``service``,
+  ``consistencygroups``, ``volumes``, ``backups``.
+
+- Renaming of rpc methods and rpc arguments will also take some releases.
+
+- Backports will get a lot more complicated since we will have cosmetic changes
+  all over the place, in API, RPC, DB, Scheduler, etc.
+
+- This will introduce a concept that doesn't make much sense in the API and
+  Scheduler nodes, since they are always grouped as a cluster of nodes and not
+  as individual nodes.
+
+- The risk of introducing new bugs with the ``host`` to ``cluster`` names in DB
+  fields, variables, method names, and method arguments is high.
+
+- Even if it's not recommended, there are people who are doing HA Active-Active
+  using the host name, and adding the ``cluster`` field would create problems
+  for them.
+
+Data model impact
+-----------------
+
+*Final result:*
+
+We need to split current ``services`` DB table (``Service`` ORM model) into 2
+different tables, ``services`` and ``service_nodes`` to support individual
+heartbeats for each node of a logical service host.
+
+Modifications to ``services`` table will be:
+
+- Removal of ``report_count`` field
+- Removal of ``modified_at`` field since we can go back to using ``updated_at``
+  field now that heartbeats will be reported on another table.
+- Removal of all version fields ``rpc_current_version``,
+  ``object_current_version`` as they will be moved to ``service_nodes`` table.
+
+New ``service_nodes`` table will have following fields:
+
+- ``id``: Unique identifier for the service
+- ``service_id``: This will be the foreign key that will join with ``services``
+  table.
+- ``name``: Primary key. Same meaning as it holds now
+- ``report_count``: Same meaning as it holds now
+- ``rpc_version``: RPC version for the service
+- ``object_version``: Versioned Objects Version for the service
+
+*Intermediate steps:*
+
+In order to support rolling upgrades we can't just drop fields in the same
+upgrade we are moving them to another table, so we need several steps/versions
+to do so.  Here's the steps that will be taken to reach desired final result
+mentioned above:
+
+In N release:
+
+- Add ``service_nodes`` table.
+- All N c-vol nodes will report heartbeats in new ``service_nodes`` table.
+- To allow coexistence of M and N nodes (necessary for rolling upgrades) the
+  detection of down nodes will be done by checking both the ``services`` table
+  and the ``service_nodes`` table and considering a node down if both reports
+  say the service is down.
+- All N c-vol nodes will report their RPC and Objects version in the
+  ``service_nodes`` table.
+- To allow coexistence of M and N nodes RPC versions will use both tables to
+  determine minimum versions that are running.
+
+In O release:
+
+- Service availability will only be done checking the ``service_nodes`` table.
+- Minimum version detection will only be done checking the ``service_nodes``
+  table.
+
+- As a final step of the rolling upgrade we'll drop fields ``report_count``,
+  ``modified_at``, ``rpc_current_version``, and ``object_current_version`` from
+  ``services`` table as they have been moved to ``service_nodes`` table.
+
+REST API impact
+---------------
+
+To report status of services we will add an optional parameter called
+``node_detail`` which will report the node breakdown.
+
+When ``node_detail`` is not set we will report exactly as we are doing now, to
+be backward compatible with clients, so we'll be sending the service is up if
+*any* the nodes forming the host set is up and sending that it is disabled if
+the service is globally disabled or all the nodes of the service are disabled.
+
+When ``node_detail`` is set to False, which tells us client knows of the new
+API options, we'll return new fields:
+
+- ``nodes``: Number of nodes that form the host set.
+- ``down_nodes``: Number of nodes that are down in the host set.
+- ``last_heartbeat``: Last heartbeat from any node
+
+If ``node_detail`` is set to True, we'll return a field called ``nodes`` that
+will contain a list of dictionaries with ``name``, ``status``, and
+``heartbeat`` keys that will contain individual values for each of the nodes of
+the host set.
+
+Changes will be backward compatible with XML responses, but new functionality
+will not work with XML responses.
+
+Security impact
+---------------
+
+None
+
+Notifications impact
+--------------------
+
+None
+
+Other end user impact
+---------------------
+
+None
+
+Performance Impact
+------------------
+
+Negligible if we implement the aggregation of the heartbeats on a SQL query
+using exist instead of retrieving all heartbeats and doing the aggregation on
+the scheduler.
+
+Other deployer impact
+---------------------
+
+None
+
+Developer impact
+----------------
+
+None
+
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Primary assignee:
+  Gorka Eguileor (geguileo)
+
+Other contributors:
+  Michal Dulko (dulek)
+  Scott DAngelo (scottda)
+  Anyone is welcome to help
+
+Work Items
+----------
+
+- Add the new ``service_nodes`` table.
+
+- Add `node_name` configuration option.
+
+- Modify Scheduler code to aggregate the different heartbeats.
+
+- Change c-vol heartbeat mechanism.
+
+- Change API's service index response as well as the update.
+
+- Update cinder-client to support new service listing.
+
+- Update manage client service commands.
+
+
+Dependencies
+============
+
+None
+
+
+Testing
+=======
+
+Unittests for new API behavior.
+
+
+Documentation Impact
+====================
+
+This spec has changes to the API as well as a new configuration option that
+will need to be documented.
+
+
+References
+==========
+
+None