Merge "Job Distribution to support HA A/A"
This commit is contained in:
commit
73afc6b7a8
434
specs/newton/ha-aa-job-distribution.rst
Normal file
434
specs/newton/ha-aa-job-distribution.rst
Normal file
@ -0,0 +1,434 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
=============================================================
|
||||
Cinder Volume Active/Active support - Job Distribution
|
||||
=============================================================
|
||||
|
||||
https://blueprints.launchpad.net/cinder/+spec/cinder-volume-active-active-support
|
||||
|
||||
Right now cinder-volume service can run only in Active/Passive HA fashion.
|
||||
|
||||
One of the reasons for this is that we have no concept of a cluster of nodes
|
||||
that handle the same storage back-end.
|
||||
|
||||
This spec introduces a slight modification to the `host` concept in Cinder to
|
||||
include the concept of a cluster as well and provide a way for Cinder's API and
|
||||
Scheduler nodes to distribute jobs to Volume nodes on a High Availability
|
||||
deployment with Active/Active configuration.
|
||||
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
Right now cinder-volume service only accepts Active/Passive High Availability
|
||||
configurations, and current job distribution mechanism does not allow having
|
||||
multiple hosts grouped in a cluster where jobs can be queued to be processed by
|
||||
any of those nodes.
|
||||
|
||||
Jobs are currently being distributed using a topic based message queue that is
|
||||
identified by the `volume_topic`, `scheduler_topic`, or `backup_topic` prefix
|
||||
joined with the host name and possibly the backend name if it's a multibackend
|
||||
node like in `cinder-volume.localhost@lvm`, and that's the mechanism used to
|
||||
send jobs to the Volume nodes regardless of the physical address of the node
|
||||
that is going to be handling the job, allowing an easier transition on
|
||||
failover.
|
||||
|
||||
Chosen solution must be backward compatible as well as allow the new
|
||||
Active/Active configuration to effectively send jobs.
|
||||
|
||||
In the Active/Active configuration there can be multiple Volume nodes - this is
|
||||
not mandatory at all times, as failures may leave us with only 1 active node -
|
||||
with different `host` configuration values that can interchangeably accept jobs
|
||||
that are handling the same storage backend.
|
||||
|
||||
|
||||
Use Cases
|
||||
=========
|
||||
|
||||
Operators that have hard requirements, SLA or other reasons, to have their
|
||||
cloud operational at all times or have higher throughput requirements will want
|
||||
to have the possibility to configure their deployments with an Active/Active
|
||||
configuration.
|
||||
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
Basic mechanism
|
||||
---------------
|
||||
|
||||
To provide a mechanism that will allow us to distribute jobs to a group of
|
||||
nodes we'll make a slight change in the meaning of the `host` configuration
|
||||
option in Cinder. Instead of just being "the name of this node" as it is now,
|
||||
it will be the "logical host name that groups a set of - one or many -
|
||||
homogeneous nodes under a common name. This host set will behave as one entity
|
||||
to the callers".
|
||||
|
||||
That way the `host` configuration option will uniquely identify a group of
|
||||
Volume nodes that share the same storage backends and therefore can accept jobs
|
||||
for the same volumes interchangeably.
|
||||
|
||||
This means that now `host` can have the same value on multiple volume nodes,
|
||||
with the only requisite that all nodes that share the same value must also
|
||||
share the same storage backends and configuration.
|
||||
|
||||
Since now we are using `host` as a logical grouping we need some other way to
|
||||
uniquely identify each node, and that's where a new configuration option called
|
||||
`node_name` comes in.
|
||||
|
||||
By default both `node_name` and `host` will take the value of
|
||||
`socket.gethostname()`, this will allow us to be upgrade compatible with
|
||||
deployments that are currently using `host` field to work with unsupported
|
||||
Active/Active configurations.
|
||||
|
||||
By using the same `host` configuration option that we were previously using we
|
||||
are able to keep the messaging queue topics as they were and we ensure not only
|
||||
backward compatibility but also conformity with rolling upgrades requirements.
|
||||
|
||||
One benefit of this solution is that it will be very easy to change a
|
||||
deployment from Single node or Active/Passive to Active/Active without any
|
||||
downtime. All that is needed is to configure the same `host` configuration
|
||||
option that the active node has and configure storage backends in the same way,
|
||||
`node_name` must be different though.
|
||||
|
||||
This new mechanism will promote the host "service unit" to allow grouping and
|
||||
will add a new node unit that will equate to the old host unit in order to
|
||||
maintain the granularity of some operations.
|
||||
|
||||
Heartbeats
|
||||
----------
|
||||
|
||||
With Active/Passive configurations a storage backend service is down whenever
|
||||
we don't have a valid heartbeat from the host and is up if we do. These
|
||||
heartbeats are reported in the DB in ``services`` table.
|
||||
|
||||
On Active/Active configurations a service is down if there is no valid
|
||||
heartbeat from any of the nodes that constitute the host set, and it is up if
|
||||
there is at least one valid heartbeat.
|
||||
|
||||
Since we are moving from a one to one relationship between hosts and services
|
||||
to a many to one relationship, we will need to make some changes to the DB
|
||||
services table as explained in `Data model impact`_.
|
||||
|
||||
Each node backend will be reporting individual heartbeats and the scheduler
|
||||
will aggregate this information to know which backends are up/down grouping by
|
||||
the `host` field. This way we'll also be able to tell which specific nodes are
|
||||
up/down based on their individual heartbeats.
|
||||
|
||||
We'll also need to update the REST API to report detailed information of the
|
||||
status of the different nodes in the cluster as explained in `REST API
|
||||
impact`_.
|
||||
|
||||
Disabling
|
||||
---------
|
||||
|
||||
Even though we'll now have multiple nodes working under the same `host` we
|
||||
won't be changing disabling behavior in any way. Disabling a service will
|
||||
still prevent schedulers from taking that service into consideration during
|
||||
filtering and weighting and the service will still be reachable for all
|
||||
operations that don't go through the scheduler.
|
||||
|
||||
It stands to reason that sometimes we'll need to drain nodes to remove them
|
||||
from a service group, but this spec and its implementation will not be adding
|
||||
any new mechanism for that. So existing mechanism should be used to perform
|
||||
graceful shutdown of c-vol nodes.
|
||||
|
||||
Current graceful shutdown mechanism will make sure that no new operations are
|
||||
received from the AMQP queue while it waits for ongoing operations to complete
|
||||
before stopping.
|
||||
|
||||
It is important to remember that graceful shutdown has a timeout that will
|
||||
forcefully stop operations if they take longer than the configured value.
|
||||
Configuration option is called `graceful_shutdown_timeout`, goes in [DEFAULT]
|
||||
section and takes a default value of 60 seconds; so this should be configured
|
||||
in our deployments if we think this is not long enough for our use cases.
|
||||
|
||||
Capabilities
|
||||
------------
|
||||
|
||||
All Volume Nodes periodically report their capabilities to the schedulers to
|
||||
keep them updated with their stats, that way they can make informed decisions
|
||||
on where to perform operations.
|
||||
|
||||
In a similar way to the Service state reporting we need to prevent concurrent
|
||||
access to the data structure when updating this information. Fortunately for us
|
||||
we are storing this information in a Python dictionary on the schedulers, and
|
||||
since we are using an eventlet executor for the RPC server we don’t have to
|
||||
worry about using locks, the inherent behavior of the executor will prevent
|
||||
concurrent access to the dictionary. So no changes are needed there to have
|
||||
exclusive access to the data structure.
|
||||
|
||||
Although rare, we could have a consistency problem among nodes where different
|
||||
schedulers would not have the same information for a given backend.
|
||||
|
||||
When we had only 1 node reporting for each given backend this was not a
|
||||
situation that could happen, since received capabilities report was always the
|
||||
latest and all scheduler nodes were in sync. But now that we have multiple
|
||||
nodes reporting on the same backend we could receive two reports from different
|
||||
Volume nodes on the same backend and they could be processed in different order
|
||||
on different nodes, thus making as have different data on each
|
||||
scheduler.
|
||||
|
||||
The reason why we can't assure that all schedulers will have the same
|
||||
capabilities stored in their internal structures is because capabilities
|
||||
reports can be processed in a different order on different nodes. Order is
|
||||
preserved in *almost all* stages, nodes report in a specific order and message
|
||||
broker preserves this order and they are even delivered in the same order, but
|
||||
when each node processes them we can have greenthreads execution in different
|
||||
order on different nodes thus ending up with different data on each node.
|
||||
|
||||
This case could probably be ignored since it's very rare and differences would
|
||||
be small, but in the interest of consistent of the backend capabilities on
|
||||
Scheduler nodes, we will timestamp the capabilities on the Volume nodes before
|
||||
they are sent to the scheduler, instead of doing it on the scheduler as we are
|
||||
doing now. And then we'll have schedulers drop any capabilities that is older
|
||||
than the one in the data structure.
|
||||
|
||||
By making this change we facilitate new features related to capability
|
||||
reporting, like capability caching. Since capability gathering is usually an
|
||||
expensive operation and in Active-Active configurations we'll have multiple
|
||||
nodes requesting the same capabilities with the same frequency for the same
|
||||
back-end, so capability caching could be a good solution to decrease the cost
|
||||
of the gathering on the backend.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
One alternative to proposed job distribution would be to leave the topic queues
|
||||
as they are and move the job distribution logic to the scheduler.
|
||||
|
||||
The scheduler would receive a job and then send it to one of the hosts that
|
||||
belong to the same cluster and is not down.
|
||||
|
||||
This method has one problem, and that is that we could be sending a job to a
|
||||
node that is down but whose heartbeat hasn't expired yet, or one that has gone
|
||||
down before getting the job from the queue. In these cases we would end up
|
||||
with a job that is not being processed by anyone and we would need to either
|
||||
wait for the node to go back up or the scheduler would need to retrieve that
|
||||
message from the queue and send it to another active node.
|
||||
|
||||
An alternative to proposed heartbeats is that all services report using
|
||||
`cluster@backend` instead of `host@backend` like they are doing now and as long
|
||||
as we have a valid heartbeat we know that the service is up.
|
||||
|
||||
There are 2 reasons why I believe that sending independent heartbeats is a
|
||||
superior solution, even if we need to modify the DB tables:
|
||||
|
||||
- Higher information granularity: We can report not only which services are
|
||||
up/down but also which nodes are up/down.
|
||||
|
||||
- It will help us on job cleanup of failed nodes that do not come back up.
|
||||
Although cleanup is not part of this spec, it is good to keep it in mind and
|
||||
facilitate it as much as possible.
|
||||
|
||||
Another alternative for the job distribution, which was the proposed solution
|
||||
in previous versions of this specification, was to add a new configuration
|
||||
option called ``cluster`` instead of ``node`` as we propose now.
|
||||
|
||||
In that case message topics would change and instead of using ``host`` for them
|
||||
they would use this new ``cluster`` option that would default when undefined to
|
||||
the same value as ``host``.
|
||||
|
||||
The main advantage of this alternative is that it's easier to understand the
|
||||
concepts of a cluster comprised of hosts than a conceptual host entity composed
|
||||
of interchangeable nodes.
|
||||
|
||||
Unfortunately this lone advantage is easily outweighed by the overwhelming
|
||||
number of disadvantages that presents:
|
||||
|
||||
- Since we cannot rename DB fields directly with rolling upgrades we have to
|
||||
make progressive changes through 3 releases to reach desired state of
|
||||
renaming ``host`` field to ``cluster`` field in tables ``service``,
|
||||
``consistencygroups``, ``volumes``, ``backups``.
|
||||
|
||||
- Renaming of rpc methods and rpc arguments will also take some releases.
|
||||
|
||||
- Backports will get a lot more complicated since we will have cosmetic changes
|
||||
all over the place, in API, RPC, DB, Scheduler, etc.
|
||||
|
||||
- This will introduce a concept that doesn't make much sense in the API and
|
||||
Scheduler nodes, since they are always grouped as a cluster of nodes and not
|
||||
as individual nodes.
|
||||
|
||||
- The risk of introducing new bugs with the ``host`` to ``cluster`` names in DB
|
||||
fields, variables, method names, and method arguments is high.
|
||||
|
||||
- Even if it's not recommended, there are people who are doing HA Active-Active
|
||||
using the host name, and adding the ``cluster`` field would create problems
|
||||
for them.
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
*Final result:*
|
||||
|
||||
We need to split current ``services`` DB table (``Service`` ORM model) into 2
|
||||
different tables, ``services`` and ``service_nodes`` to support individual
|
||||
heartbeats for each node of a logical service host.
|
||||
|
||||
Modifications to ``services`` table will be:
|
||||
|
||||
- Removal of ``report_count`` field
|
||||
- Removal of ``modified_at`` field since we can go back to using ``updated_at``
|
||||
field now that heartbeats will be reported on another table.
|
||||
- Removal of all version fields ``rpc_current_version``,
|
||||
``object_current_version`` as they will be moved to ``service_nodes`` table.
|
||||
|
||||
New ``service_nodes`` table will have following fields:
|
||||
|
||||
- ``id``: Unique identifier for the service
|
||||
- ``service_id``: This will be the foreign key that will join with ``services``
|
||||
table.
|
||||
- ``name``: Primary key. Same meaning as it holds now
|
||||
- ``report_count``: Same meaning as it holds now
|
||||
- ``rpc_version``: RPC version for the service
|
||||
- ``object_version``: Versioned Objects Version for the service
|
||||
|
||||
*Intermediate steps:*
|
||||
|
||||
In order to support rolling upgrades we can't just drop fields in the same
|
||||
upgrade we are moving them to another table, so we need several steps/versions
|
||||
to do so. Here's the steps that will be taken to reach desired final result
|
||||
mentioned above:
|
||||
|
||||
In N release:
|
||||
|
||||
- Add ``service_nodes`` table.
|
||||
- All N c-vol nodes will report heartbeats in new ``service_nodes`` table.
|
||||
- To allow coexistence of M and N nodes (necessary for rolling upgrades) the
|
||||
detection of down nodes will be done by checking both the ``services`` table
|
||||
and the ``service_nodes`` table and considering a node down if both reports
|
||||
say the service is down.
|
||||
- All N c-vol nodes will report their RPC and Objects version in the
|
||||
``service_nodes`` table.
|
||||
- To allow coexistence of M and N nodes RPC versions will use both tables to
|
||||
determine minimum versions that are running.
|
||||
|
||||
In O release:
|
||||
|
||||
- Service availability will only be done checking the ``service_nodes`` table.
|
||||
- Minimum version detection will only be done checking the ``service_nodes``
|
||||
table.
|
||||
|
||||
- As a final step of the rolling upgrade we'll drop fields ``report_count``,
|
||||
``modified_at``, ``rpc_current_version``, and ``object_current_version`` from
|
||||
``services`` table as they have been moved to ``service_nodes`` table.
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
To report status of services we will add an optional parameter called
|
||||
``node_detail`` which will report the node breakdown.
|
||||
|
||||
When ``node_detail`` is not set we will report exactly as we are doing now, to
|
||||
be backward compatible with clients, so we'll be sending the service is up if
|
||||
*any* the nodes forming the host set is up and sending that it is disabled if
|
||||
the service is globally disabled or all the nodes of the service are disabled.
|
||||
|
||||
When ``node_detail`` is set to False, which tells us client knows of the new
|
||||
API options, we'll return new fields:
|
||||
|
||||
- ``nodes``: Number of nodes that form the host set.
|
||||
- ``down_nodes``: Number of nodes that are down in the host set.
|
||||
- ``last_heartbeat``: Last heartbeat from any node
|
||||
|
||||
If ``node_detail`` is set to True, we'll return a field called ``nodes`` that
|
||||
will contain a list of dictionaries with ``name``, ``status``, and
|
||||
``heartbeat`` keys that will contain individual values for each of the nodes of
|
||||
the host set.
|
||||
|
||||
Changes will be backward compatible with XML responses, but new functionality
|
||||
will not work with XML responses.
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
None
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
Negligible if we implement the aggregation of the heartbeats on a SQL query
|
||||
using exist instead of retrieving all heartbeats and doing the aggregation on
|
||||
the scheduler.
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
None
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
Gorka Eguileor (geguileo)
|
||||
|
||||
Other contributors:
|
||||
Michal Dulko (dulek)
|
||||
Scott DAngelo (scottda)
|
||||
Anyone is welcome to help
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
- Add the new ``service_nodes`` table.
|
||||
|
||||
- Add `node_name` configuration option.
|
||||
|
||||
- Modify Scheduler code to aggregate the different heartbeats.
|
||||
|
||||
- Change c-vol heartbeat mechanism.
|
||||
|
||||
- Change API's service index response as well as the update.
|
||||
|
||||
- Update cinder-client to support new service listing.
|
||||
|
||||
- Update manage client service commands.
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None
|
||||
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Unittests for new API behavior.
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
This spec has changes to the API as well as a new configuration option that
|
||||
will need to be documented.
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
None
|
Loading…
x
Reference in New Issue
Block a user