Update Job Distribution for A/A Specs
During Austin Summit it was decided that we should change approved specs approach to Job Distribution for Active-Active configurations. This patch updates the specs to use the concept of cluster instead of using the host and adding the concept of nodes as agreed during the design session. It also moves the spec to Ocata to make sure there's no confusion on whether this is implemented or not. Change-Id: I5f8194a52ff249d85fdf4ac9cfb184a9f33e1ccc Implements: blueprint cinder-volume-active-active-support
This commit is contained in:
parent
b4baff9915
commit
a209a64c16
@ -1,434 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
=============================================================
|
||||
Cinder Volume Active/Active support - Job Distribution
|
||||
=============================================================
|
||||
|
||||
https://blueprints.launchpad.net/cinder/+spec/cinder-volume-active-active-support
|
||||
|
||||
Right now cinder-volume service can run only in Active/Passive HA fashion.
|
||||
|
||||
One of the reasons for this is that we have no concept of a cluster of nodes
|
||||
that handle the same storage back-end.
|
||||
|
||||
This spec introduces a slight modification to the `host` concept in Cinder to
|
||||
include the concept of a cluster as well and provide a way for Cinder's API and
|
||||
Scheduler nodes to distribute jobs to Volume nodes on a High Availability
|
||||
deployment with Active/Active configuration.
|
||||
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
Right now cinder-volume service only accepts Active/Passive High Availability
|
||||
configurations, and current job distribution mechanism does not allow having
|
||||
multiple hosts grouped in a cluster where jobs can be queued to be processed by
|
||||
any of those nodes.
|
||||
|
||||
Jobs are currently being distributed using a topic based message queue that is
|
||||
identified by the `volume_topic`, `scheduler_topic`, or `backup_topic` prefix
|
||||
joined with the host name and possibly the backend name if it's a multibackend
|
||||
node like in `cinder-volume.localhost@lvm`, and that's the mechanism used to
|
||||
send jobs to the Volume nodes regardless of the physical address of the node
|
||||
that is going to be handling the job, allowing an easier transition on
|
||||
failover.
|
||||
|
||||
Chosen solution must be backward compatible as well as allow the new
|
||||
Active/Active configuration to effectively send jobs.
|
||||
|
||||
In the Active/Active configuration there can be multiple Volume nodes - this is
|
||||
not mandatory at all times, as failures may leave us with only 1 active node -
|
||||
with different `host` configuration values that can interchangeably accept jobs
|
||||
that are handling the same storage backend.
|
||||
|
||||
|
||||
Use Cases
|
||||
=========
|
||||
|
||||
Operators that have hard requirements, SLA or other reasons, to have their
|
||||
cloud operational at all times or have higher throughput requirements will want
|
||||
to have the possibility to configure their deployments with an Active/Active
|
||||
configuration.
|
||||
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
Basic mechanism
|
||||
---------------
|
||||
|
||||
To provide a mechanism that will allow us to distribute jobs to a group of
|
||||
nodes we'll make a slight change in the meaning of the `host` configuration
|
||||
option in Cinder. Instead of just being "the name of this node" as it is now,
|
||||
it will be the "logical host name that groups a set of - one or many -
|
||||
homogeneous nodes under a common name. This host set will behave as one entity
|
||||
to the callers".
|
||||
|
||||
That way the `host` configuration option will uniquely identify a group of
|
||||
Volume nodes that share the same storage backends and therefore can accept jobs
|
||||
for the same volumes interchangeably.
|
||||
|
||||
This means that now `host` can have the same value on multiple volume nodes,
|
||||
with the only requisite that all nodes that share the same value must also
|
||||
share the same storage backends and configuration.
|
||||
|
||||
Since now we are using `host` as a logical grouping we need some other way to
|
||||
uniquely identify each node, and that's where a new configuration option called
|
||||
`node_name` comes in.
|
||||
|
||||
By default both `node_name` and `host` will take the value of
|
||||
`socket.gethostname()`, this will allow us to be upgrade compatible with
|
||||
deployments that are currently using `host` field to work with unsupported
|
||||
Active/Active configurations.
|
||||
|
||||
By using the same `host` configuration option that we were previously using we
|
||||
are able to keep the messaging queue topics as they were and we ensure not only
|
||||
backward compatibility but also conformity with rolling upgrades requirements.
|
||||
|
||||
One benefit of this solution is that it will be very easy to change a
|
||||
deployment from Single node or Active/Passive to Active/Active without any
|
||||
downtime. All that is needed is to configure the same `host` configuration
|
||||
option that the active node has and configure storage backends in the same way,
|
||||
`node_name` must be different though.
|
||||
|
||||
This new mechanism will promote the host "service unit" to allow grouping and
|
||||
will add a new node unit that will equate to the old host unit in order to
|
||||
maintain the granularity of some operations.
|
||||
|
||||
Heartbeats
|
||||
----------
|
||||
|
||||
With Active/Passive configurations a storage backend service is down whenever
|
||||
we don't have a valid heartbeat from the host and is up if we do. These
|
||||
heartbeats are reported in the DB in ``services`` table.
|
||||
|
||||
On Active/Active configurations a service is down if there is no valid
|
||||
heartbeat from any of the nodes that constitute the host set, and it is up if
|
||||
there is at least one valid heartbeat.
|
||||
|
||||
Since we are moving from a one to one relationship between hosts and services
|
||||
to a many to one relationship, we will need to make some changes to the DB
|
||||
services table as explained in `Data model impact`_.
|
||||
|
||||
Each node backend will be reporting individual heartbeats and the scheduler
|
||||
will aggregate this information to know which backends are up/down grouping by
|
||||
the `host` field. This way we'll also be able to tell which specific nodes are
|
||||
up/down based on their individual heartbeats.
|
||||
|
||||
We'll also need to update the REST API to report detailed information of the
|
||||
status of the different nodes in the cluster as explained in `REST API
|
||||
impact`_.
|
||||
|
||||
Disabling
|
||||
---------
|
||||
|
||||
Even though we'll now have multiple nodes working under the same `host` we
|
||||
won't be changing disabling behavior in any way. Disabling a service will
|
||||
still prevent schedulers from taking that service into consideration during
|
||||
filtering and weighting and the service will still be reachable for all
|
||||
operations that don't go through the scheduler.
|
||||
|
||||
It stands to reason that sometimes we'll need to drain nodes to remove them
|
||||
from a service group, but this spec and its implementation will not be adding
|
||||
any new mechanism for that. So existing mechanism should be used to perform
|
||||
graceful shutdown of c-vol nodes.
|
||||
|
||||
Current graceful shutdown mechanism will make sure that no new operations are
|
||||
received from the AMQP queue while it waits for ongoing operations to complete
|
||||
before stopping.
|
||||
|
||||
It is important to remember that graceful shutdown has a timeout that will
|
||||
forcefully stop operations if they take longer than the configured value.
|
||||
Configuration option is called `graceful_shutdown_timeout`, goes in [DEFAULT]
|
||||
section and takes a default value of 60 seconds; so this should be configured
|
||||
in our deployments if we think this is not long enough for our use cases.
|
||||
|
||||
Capabilities
|
||||
------------
|
||||
|
||||
All Volume Nodes periodically report their capabilities to the schedulers to
|
||||
keep them updated with their stats, that way they can make informed decisions
|
||||
on where to perform operations.
|
||||
|
||||
In a similar way to the Service state reporting we need to prevent concurrent
|
||||
access to the data structure when updating this information. Fortunately for us
|
||||
we are storing this information in a Python dictionary on the schedulers, and
|
||||
since we are using an eventlet executor for the RPC server we don’t have to
|
||||
worry about using locks, the inherent behavior of the executor will prevent
|
||||
concurrent access to the dictionary. So no changes are needed there to have
|
||||
exclusive access to the data structure.
|
||||
|
||||
Although rare, we could have a consistency problem among nodes where different
|
||||
schedulers would not have the same information for a given backend.
|
||||
|
||||
When we had only 1 node reporting for each given backend this was not a
|
||||
situation that could happen, since received capabilities report was always the
|
||||
latest and all scheduler nodes were in sync. But now that we have multiple
|
||||
nodes reporting on the same backend we could receive two reports from different
|
||||
Volume nodes on the same backend and they could be processed in different order
|
||||
on different nodes, thus making as have different data on each
|
||||
scheduler.
|
||||
|
||||
The reason why we can't assure that all schedulers will have the same
|
||||
capabilities stored in their internal structures is because capabilities
|
||||
reports can be processed in a different order on different nodes. Order is
|
||||
preserved in *almost all* stages, nodes report in a specific order and message
|
||||
broker preserves this order and they are even delivered in the same order, but
|
||||
when each node processes them we can have greenthreads execution in different
|
||||
order on different nodes thus ending up with different data on each node.
|
||||
|
||||
This case could probably be ignored since it's very rare and differences would
|
||||
be small, but in the interest of consistent of the backend capabilities on
|
||||
Scheduler nodes, we will timestamp the capabilities on the Volume nodes before
|
||||
they are sent to the scheduler, instead of doing it on the scheduler as we are
|
||||
doing now. And then we'll have schedulers drop any capabilities that is older
|
||||
than the one in the data structure.
|
||||
|
||||
By making this change we facilitate new features related to capability
|
||||
reporting, like capability caching. Since capability gathering is usually an
|
||||
expensive operation and in Active-Active configurations we'll have multiple
|
||||
nodes requesting the same capabilities with the same frequency for the same
|
||||
back-end, so capability caching could be a good solution to decrease the cost
|
||||
of the gathering on the backend.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
One alternative to proposed job distribution would be to leave the topic queues
|
||||
as they are and move the job distribution logic to the scheduler.
|
||||
|
||||
The scheduler would receive a job and then send it to one of the hosts that
|
||||
belong to the same cluster and is not down.
|
||||
|
||||
This method has one problem, and that is that we could be sending a job to a
|
||||
node that is down but whose heartbeat hasn't expired yet, or one that has gone
|
||||
down before getting the job from the queue. In these cases we would end up
|
||||
with a job that is not being processed by anyone and we would need to either
|
||||
wait for the node to go back up or the scheduler would need to retrieve that
|
||||
message from the queue and send it to another active node.
|
||||
|
||||
An alternative to proposed heartbeats is that all services report using
|
||||
`cluster@backend` instead of `host@backend` like they are doing now and as long
|
||||
as we have a valid heartbeat we know that the service is up.
|
||||
|
||||
There are 2 reasons why I believe that sending independent heartbeats is a
|
||||
superior solution, even if we need to modify the DB tables:
|
||||
|
||||
- Higher information granularity: We can report not only which services are
|
||||
up/down but also which nodes are up/down.
|
||||
|
||||
- It will help us on job cleanup of failed nodes that do not come back up.
|
||||
Although cleanup is not part of this spec, it is good to keep it in mind and
|
||||
facilitate it as much as possible.
|
||||
|
||||
Another alternative for the job distribution, which was the proposed solution
|
||||
in previous versions of this specification, was to add a new configuration
|
||||
option called ``cluster`` instead of ``node`` as we propose now.
|
||||
|
||||
In that case message topics would change and instead of using ``host`` for them
|
||||
they would use this new ``cluster`` option that would default when undefined to
|
||||
the same value as ``host``.
|
||||
|
||||
The main advantage of this alternative is that it's easier to understand the
|
||||
concepts of a cluster comprised of hosts than a conceptual host entity composed
|
||||
of interchangeable nodes.
|
||||
|
||||
Unfortunately this lone advantage is easily outweighed by the overwhelming
|
||||
number of disadvantages that presents:
|
||||
|
||||
- Since we cannot rename DB fields directly with rolling upgrades we have to
|
||||
make progressive changes through 3 releases to reach desired state of
|
||||
renaming ``host`` field to ``cluster`` field in tables ``service``,
|
||||
``consistencygroups``, ``volumes``, ``backups``.
|
||||
|
||||
- Renaming of rpc methods and rpc arguments will also take some releases.
|
||||
|
||||
- Backports will get a lot more complicated since we will have cosmetic changes
|
||||
all over the place, in API, RPC, DB, Scheduler, etc.
|
||||
|
||||
- This will introduce a concept that doesn't make much sense in the API and
|
||||
Scheduler nodes, since they are always grouped as a cluster of nodes and not
|
||||
as individual nodes.
|
||||
|
||||
- The risk of introducing new bugs with the ``host`` to ``cluster`` names in DB
|
||||
fields, variables, method names, and method arguments is high.
|
||||
|
||||
- Even if it's not recommended, there are people who are doing HA Active-Active
|
||||
using the host name, and adding the ``cluster`` field would create problems
|
||||
for them.
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
*Final result:*
|
||||
|
||||
We need to split current ``services`` DB table (``Service`` ORM model) into 2
|
||||
different tables, ``services`` and ``service_nodes`` to support individual
|
||||
heartbeats for each node of a logical service host.
|
||||
|
||||
Modifications to ``services`` table will be:
|
||||
|
||||
- Removal of ``report_count`` field
|
||||
- Removal of ``modified_at`` field since we can go back to using ``updated_at``
|
||||
field now that heartbeats will be reported on another table.
|
||||
- Removal of all version fields ``rpc_current_version``,
|
||||
``object_current_version`` as they will be moved to ``service_nodes`` table.
|
||||
|
||||
New ``service_nodes`` table will have following fields:
|
||||
|
||||
- ``id``: Unique identifier for the service
|
||||
- ``service_id``: This will be the foreign key that will join with ``services``
|
||||
table.
|
||||
- ``name``: Primary key. Same meaning as it holds now
|
||||
- ``report_count``: Same meaning as it holds now
|
||||
- ``rpc_version``: RPC version for the service
|
||||
- ``object_version``: Versioned Objects Version for the service
|
||||
|
||||
*Intermediate steps:*
|
||||
|
||||
In order to support rolling upgrades we can't just drop fields in the same
|
||||
upgrade we are moving them to another table, so we need several steps/versions
|
||||
to do so. Here's the steps that will be taken to reach desired final result
|
||||
mentioned above:
|
||||
|
||||
In N release:
|
||||
|
||||
- Add ``service_nodes`` table.
|
||||
- All N c-vol nodes will report heartbeats in new ``service_nodes`` table.
|
||||
- To allow coexistence of M and N nodes (necessary for rolling upgrades) the
|
||||
detection of down nodes will be done by checking both the ``services`` table
|
||||
and the ``service_nodes`` table and considering a node down if both reports
|
||||
say the service is down.
|
||||
- All N c-vol nodes will report their RPC and Objects version in the
|
||||
``service_nodes`` table.
|
||||
- To allow coexistence of M and N nodes RPC versions will use both tables to
|
||||
determine minimum versions that are running.
|
||||
|
||||
In O release:
|
||||
|
||||
- Service availability will only be done checking the ``service_nodes`` table.
|
||||
- Minimum version detection will only be done checking the ``service_nodes``
|
||||
table.
|
||||
|
||||
- As a final step of the rolling upgrade we'll drop fields ``report_count``,
|
||||
``modified_at``, ``rpc_current_version``, and ``object_current_version`` from
|
||||
``services`` table as they have been moved to ``service_nodes`` table.
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
To report status of services we will add an optional parameter called
|
||||
``node_detail`` which will report the node breakdown.
|
||||
|
||||
When ``node_detail`` is not set we will report exactly as we are doing now, to
|
||||
be backward compatible with clients, so we'll be sending the service is up if
|
||||
*any* the nodes forming the host set is up and sending that it is disabled if
|
||||
the service is globally disabled or all the nodes of the service are disabled.
|
||||
|
||||
When ``node_detail`` is set to False, which tells us client knows of the new
|
||||
API options, we'll return new fields:
|
||||
|
||||
- ``nodes``: Number of nodes that form the host set.
|
||||
- ``down_nodes``: Number of nodes that are down in the host set.
|
||||
- ``last_heartbeat``: Last heartbeat from any node
|
||||
|
||||
If ``node_detail`` is set to True, we'll return a field called ``nodes`` that
|
||||
will contain a list of dictionaries with ``name``, ``status``, and
|
||||
``heartbeat`` keys that will contain individual values for each of the nodes of
|
||||
the host set.
|
||||
|
||||
Changes will be backward compatible with XML responses, but new functionality
|
||||
will not work with XML responses.
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
None
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
Negligible if we implement the aggregation of the heartbeats on a SQL query
|
||||
using exist instead of retrieving all heartbeats and doing the aggregation on
|
||||
the scheduler.
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
None
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
Gorka Eguileor (geguileo)
|
||||
|
||||
Other contributors:
|
||||
Michal Dulko (dulek)
|
||||
Scott DAngelo (scottda)
|
||||
Anyone is welcome to help
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
- Add the new ``service_nodes`` table.
|
||||
|
||||
- Add `node_name` configuration option.
|
||||
|
||||
- Modify Scheduler code to aggregate the different heartbeats.
|
||||
|
||||
- Change c-vol heartbeat mechanism.
|
||||
|
||||
- Change API's service index response as well as the update.
|
||||
|
||||
- Update cinder-client to support new service listing.
|
||||
|
||||
- Update manage client service commands.
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None
|
||||
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Unittests for new API behavior.
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
This spec has changes to the API as well as a new configuration option that
|
||||
will need to be documented.
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
None
|
384
specs/ocata/ha-aa-job-distribution.rst
Normal file
384
specs/ocata/ha-aa-job-distribution.rst
Normal file
@ -0,0 +1,384 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
=============================================================
|
||||
Cinder Volume Active/Active support - Job Distribution
|
||||
=============================================================
|
||||
|
||||
https://blueprints.launchpad.net/cinder/+spec/cinder-volume-active-active-support
|
||||
|
||||
Right now cinder-volume service can run only in Active/Passive HA fashion.
|
||||
|
||||
One of the reasons for this is that we have no concept of a cluster of nodes
|
||||
that handle the same storage back-end.
|
||||
|
||||
This spec introduces the concept of cluster to Cinder and aims to provide a way
|
||||
for Cinder's API and Scheduler nodes to distribute jobs to Volume nodes on a
|
||||
High Availability deployment with Active/Active configuration.
|
||||
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
Right now cinder-volume service only accepts Active/Passive High Availability
|
||||
configurations, and current job distribution mechanism does not allow having
|
||||
multiple services grouped in a cluster where jobs can be queued to be processed
|
||||
by any of those nodes.
|
||||
|
||||
Jobs are currently being distributed using a topic based message queue that is
|
||||
identified by the `volume_topic`, `scheduler_topic`, or `backup_topic` prefix
|
||||
joined with the host name and possibly the backend name if it's a multibackend
|
||||
node like in `cinder-volume.localhost@lvm`, and that's the mechanism used to
|
||||
send jobs to the Volume nodes regardless of the physical address of the node
|
||||
that is going to be handling the job, allowing an easier transition on
|
||||
failover.
|
||||
|
||||
Chosen solution must be backward compatible as well as allow the new
|
||||
Active/Active configuration to effectively send jobs.
|
||||
|
||||
In the Active/Active configuration there can be multiple Volume services - this
|
||||
is not mandatory at all times, as failures may leave us with only 1 active
|
||||
service - with different `host` configuration values that can interchangeably
|
||||
accept jobs that are handling the same storage backend.
|
||||
|
||||
|
||||
Use Cases
|
||||
=========
|
||||
|
||||
Operators that have hard requirements, SLA or other reasons, to have their
|
||||
cloud operational at all times or have higher throughput requirements will want
|
||||
to have the possibility to configure their deployments with an Active/Active
|
||||
configuration.
|
||||
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
Basic mechanism
|
||||
---------------
|
||||
|
||||
To provide a mechanism that will allow us to distribute jobs to a group of
|
||||
nodes we'll add a new `cluster` configuration option that will uniquely
|
||||
identify a group of Volume nodes that share the same storage backends and
|
||||
therefore can accept jobs for the same volumes interchangeably.
|
||||
|
||||
This new configuration option, unlike the `host` option, will be allowed to
|
||||
have the same value on multiple volume nodes, with the only requisite that all
|
||||
nodes that share the same value must also share the same storage backends and
|
||||
they must also share the same configurations.
|
||||
|
||||
By default `cluster` configuration option will be undefined, but when a string
|
||||
value is given a new topic queue will be created on the message broker to
|
||||
distribute jobs meant for that cluster in the form of
|
||||
`cinder-volume.cluster@backend` similar to already existing host topic queue
|
||||
`cinder-volume.host@backend`.
|
||||
|
||||
It is important to notice that `cluster` configuration option is not a
|
||||
replacement of the `host` option as both will coexist within the service and
|
||||
must exist for Active-Active configurations.
|
||||
|
||||
To be able to determine the topic queue where an RPC caller has to send
|
||||
operations we'll add `cluster_name` field to any resource DB table that
|
||||
currently has the `host` field we are using for non Active/Active
|
||||
configurations. This way we don't need to check the DB, or keep a cache in
|
||||
memory, to figure out in which cluster is this service included, if it is in a
|
||||
cluster at all.
|
||||
|
||||
Once the basic mechanism of receiving RPC calls on the cluster topic queue is
|
||||
in place, operations will be incrementally moved to support Active-Active if
|
||||
the resource is in a cluster, as indicated by the presence of a value in the
|
||||
`cluster_name` resource field.
|
||||
|
||||
The reason behind this progressive approach instead of an all or nothing
|
||||
approach is to reduce the possibility of adding new bugs and facilitating quick
|
||||
fixes by just reverting a specific patch.
|
||||
|
||||
This solution makes a clear distinction between independent services and those
|
||||
that belong to a cluster, and the same can be said about resources belonging to
|
||||
a cluster.
|
||||
|
||||
To facilitate the inclusion of a service in a cluster, the volume manager will
|
||||
detect when the `cluster` value has changed from being undefined to having a
|
||||
value and proceed to include all existing resources in the cluster by filling
|
||||
the `cluster_name` fields.
|
||||
|
||||
Having both message queues, one for the cluster and one for the service, could
|
||||
prove useful in the future if we want to add operations that can target
|
||||
specific services within a cluster.
|
||||
|
||||
Heartbeats
|
||||
----------
|
||||
|
||||
With Active/Passive configurations a storage backend service is down whenever
|
||||
we don't have a valid heartbeat from the service and is up if we do. These
|
||||
heartbeats are reported in the DB in ``services`` table.
|
||||
|
||||
On Active/Active configurations a service is down if there is no valid
|
||||
heartbeat from any of the services that constitute the cluster, and it is up if
|
||||
there is at least one valid heartbeat.
|
||||
|
||||
Services will keep reporting their heartbeats in the same way that they are
|
||||
doing it now, and it will be Scheduler's job to separate between individual and
|
||||
clustered services and aggregate the latter by cluster name.
|
||||
|
||||
As explained in `REST API impact`_ the API will be able to show cluster
|
||||
information with the status -up or down- of each cluster, based on the services
|
||||
that belong to it.
|
||||
|
||||
Disabling
|
||||
---------
|
||||
|
||||
This new mechanism will change the "disabling working unit" from service to
|
||||
cluster for services that are in a cluster. Which means that once all
|
||||
operations that go through the scheduler have been moved to support
|
||||
Active-Active configurations, we won't be able to disable an individual service
|
||||
belonging to a cluster and we'll have to disable the cluster itself. For non
|
||||
clustered services, disabling will work as usual.
|
||||
|
||||
Disabling a cluster will prevent schedulers from taking that cluster, and
|
||||
therefore all its services, into consideration during filtering and weighting
|
||||
and the service will still be reachable to all operations that don't go through
|
||||
the scheduler.
|
||||
|
||||
It stands to reason that sometimes we'll need to drain nodes to remove them
|
||||
from a cluster, but this spec and its implementation will not be adding any new
|
||||
mechanism for that. So existing mechanism, using SIGTERM, should be used to
|
||||
perform graceful shutdown of cinder volume services.
|
||||
|
||||
Current graceful shutdown mechanism will make sure that no new operations are
|
||||
received from the messaging queue while it waits for ongoing operations to
|
||||
complete before stopping.
|
||||
|
||||
It is important to remember that graceful shutdown has a timeout that will
|
||||
forcefully stop operations if they take longer than the configured value.
|
||||
Configuration option is called `graceful_shutdown_timeout`, goes in [DEFAULT]
|
||||
section and takes a default value of 60 seconds; so this should be configured
|
||||
in our deployments if we think this is not long enough for our use cases.
|
||||
|
||||
Capabilities
|
||||
------------
|
||||
|
||||
All Volume services periodically report their capabilities to the schedulers to
|
||||
keep them updated with their stats, that way they can make informed decisions
|
||||
on where to perform operations.
|
||||
|
||||
In a similar way to the Service state reporting we need to prevent concurrent
|
||||
access to the data structure when updating this information. Fortunately for us
|
||||
we are storing this information in a Python dictionary on the schedulers, and
|
||||
since we are using an eventlet executor for the RPC server we don’t have to
|
||||
worry about using locks, the inherent behavior of the executor will prevent
|
||||
concurrent access to the dictionary. So no changes are needed there to have
|
||||
exclusive access to the data structure.
|
||||
|
||||
Although rare, we could have a consistency problem among volume services where
|
||||
different schedulers would not have the same information for a given backend.
|
||||
|
||||
When we had only 1 volume service reporting for each given backend this was not
|
||||
a situation that could happen, since received capabilities report was always
|
||||
the latest and all scheduler services were in sync. But now that we have
|
||||
multiple volume services reporting on the same backend we could receive two
|
||||
reports from different volume services on the same backend and they could be
|
||||
processed in different order on different schedulers, thus making us have
|
||||
different data on each scheduler.
|
||||
|
||||
The reason why we can't assure that all schedulers will have the same
|
||||
capabilities stored in their internal structures is because capabilities
|
||||
reports can be processed in different order on different services. Order is
|
||||
preserved in *almost all* stages, volume services report in a specific order
|
||||
and message broker preserves this order and they are even delivered in the same
|
||||
order, but when each service processes them we can have greenthreads execution
|
||||
in different order on different scheduler services thus ending up with
|
||||
different data on each service.
|
||||
|
||||
This case could probably be ignored since it's very rare and differences would
|
||||
be small, but in the interest of consistent of the backend capabilities on
|
||||
Scheduler services, we will timestamp the capabilities on the volume services
|
||||
before they are sent to the scheduler, instead of doing it on the scheduler as
|
||||
we are doing now. And then we'll have schedulers drop any capabilities that are
|
||||
older than the one in the data structure.
|
||||
|
||||
By making this change we facilitate new features related to capability
|
||||
reporting, like capability caching. Since capability gathering is usually an
|
||||
expensive operation and in Active-Active configurations we'll have multiple
|
||||
volume services requesting the same capabilities with the same frequency for
|
||||
the same back-end, we could consider capability caching as solution to decrease
|
||||
the cost of the gathering on the backend.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
One alternative to proposed job distribution would be to leave the topic queues
|
||||
as they are and move the job distribution logic to the scheduler.
|
||||
|
||||
The scheduler would receive a job and then send it to one of the volume
|
||||
services that belong to the same cluster and is not down.
|
||||
|
||||
This method has one problem, and that is that we could be sending a job to a
|
||||
node that is down but whose heartbeat hasn't expired yet, or one that has gone
|
||||
down before getting the job from the queue. In these cases we would end up
|
||||
with a job that is not being processed by anyone and we would need to either
|
||||
wait for the node to go back up or the scheduler would need to retrieve that
|
||||
message from the queue and send it to another active node.
|
||||
|
||||
An alternative to proposed heartbeats is that all services report using
|
||||
`cluster@backend` instead of `host@backend` like they are doing now and as long
|
||||
as we have a valid heartbeat we know that the service is up.
|
||||
|
||||
There are 2 reasons why I believe that sending independent heartbeats is a
|
||||
superior solution, even if we need to modify the DB tables:
|
||||
|
||||
- Higher information granularity: We can report not only which services are
|
||||
up/down but also which nodes are up/down.
|
||||
|
||||
- It will help us on job cleanup of failed nodes that do not come back up.
|
||||
Although cleanup is not part of this spec, it is good to keep it in mind and
|
||||
facilitate it as much as possible.
|
||||
|
||||
Another alternative for the job distribution, which was the proposed solution
|
||||
in previous versions of this specification, was to use `host` configuration
|
||||
option as the equivalent to `cluster` grouping a new added `node` configuration
|
||||
option that would serve to identify individual nodes.
|
||||
|
||||
Using such solution may lead to misunderstandings with the concept of hosts as
|
||||
clusters, whereas using the cluster concept directly there is no such problem,
|
||||
wouldn't allow a progressive solution as it was a one shot change, and we
|
||||
couldn't send messages to individual volume services since we only had the host
|
||||
message topic queue.
|
||||
|
||||
There is a series of patches showing the implementation of the
|
||||
`node alternative mechanism`_ that can serve as a more detailed explanation.
|
||||
|
||||
Another possibility would be to allow disabling individual services within a
|
||||
cluster instead of having to disable the whole cluster, and this is something
|
||||
we can take up after everything else is done. To do this we would use the
|
||||
normal host message queue on the cinder-volume service to receive the
|
||||
enable/disable of the cluster on the manager and that would trigger a
|
||||
start/stop of the cluster topic queue. But this is not trivial, as it requires
|
||||
us to be able to stop and start the client for the cluster topic from the
|
||||
cinder volume manager (it is managed at the service level) and be able to wait
|
||||
for a full stop before we can accept a new enable request to start the message
|
||||
client again.
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
*Final result:*
|
||||
|
||||
A new ``clusters`` table will be added with the following fields:
|
||||
|
||||
- ``id``: Unique identifier for the cluster
|
||||
- ``name``: Name of the cluster, it is used to build the topic queue in the
|
||||
same way the `host` configuration option is used. This comes from
|
||||
the `cluster` configuration option.
|
||||
- ``binary``: For now it will always be "cinder-volume" but when we add backups
|
||||
it'll also accept "cinder-backup".
|
||||
- ``disabled``: To support disabling clusters.
|
||||
- ``disabled_reason``: Same as in service table.
|
||||
- ``race_preventer``: This field will be used to prevent potential races that
|
||||
could happen if 2 new services are brought up at the same
|
||||
time and both try to create the cluster entry at the same
|
||||
time.
|
||||
|
||||
A ``cluster_name`` field will be added to existing ``services``, ``volumes``,
|
||||
and ``consistencygroups`` tables.
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
Service listing will return ``cluster_name`` field when requested with the
|
||||
appropriate microversion.
|
||||
|
||||
A new ``clusters`` endpoint will be added to list -detailed and summarized-,
|
||||
show, and update operations with their respective policies.
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
None
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
Negligible if we implement the aggregation of the heartbeats on a SQL query
|
||||
using exist instead of retrieving all heartbeats and doing the aggregation on
|
||||
the scheduler.
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
None
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
Gorka Eguileor (geguileo)
|
||||
|
||||
Other contributors:
|
||||
Michal Dulko (dulek)
|
||||
Scott DAngelo (scottda)
|
||||
Anyone is welcome to help
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
- Add the new ``clusters`` table and related operations.
|
||||
|
||||
- Add Cluster Versioned Object.
|
||||
|
||||
- Modify job distribution to use new ``cluster`` configuration option.
|
||||
|
||||
- Update service API and add new ``clusters`` endpoint.
|
||||
|
||||
- Update cinder-client to support new endpoint and new field in services.
|
||||
|
||||
- Move operations to Active-Active.
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None
|
||||
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Unittests for new API behavior.
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
This spec has changes to the API as well as a new configuration option that
|
||||
will need to be documented.
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
None
|
||||
|
||||
|
||||
.. _`node alternative mechanism`: https://review.openstack.org/286599
|
Loading…
x
Reference in New Issue
Block a user