7b9dcb89fd
This spec outlines work needed to modify resource cleanup for clusters with Active/Active configuration in cinder. This is one of the aspects that are needed to support Active/Active in cinder. Implements: blueprint cinder-volume-active-active-support Change-Id: Ibd448993f0b6c6396cba8e8edd7836145833605a
342 lines
13 KiB
ReStructuredText
342 lines
13 KiB
ReStructuredText
..
|
|
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
|
License.
|
|
|
|
http://creativecommons.org/licenses/by/3.0/legalcode
|
|
|
|
=============================================================
|
|
Cinder Volume Active/Active support - Cleanup
|
|
=============================================================
|
|
|
|
https://blueprints.launchpad.net/cinder/+spec/cinder-volume-active-active-support
|
|
|
|
Right now cinder-volume service can run only in Active/Passive HA fashion.
|
|
|
|
One of the reasons for this is that we have no concept of a cluster of nodes
|
|
that handle the same storage back-end, and we assume only one volume service
|
|
can access a specific storage back-end.
|
|
|
|
Given this premise, current code handles the cleanup for failed volume services
|
|
as if no other service is working with resources from his back-end, and that is
|
|
problematic when there are other volume services working with those resources,
|
|
as is the case on an Active/Active configuration.
|
|
|
|
This spec introduces a new cleanup mechanism and modifies current cleanup
|
|
mechanism so proper cleanup is done regardless of cinder configuration,
|
|
Active/Passive or Active/Active.
|
|
|
|
|
|
Problem description
|
|
===================
|
|
|
|
Current Cinder code only supports Active/Passive configurations, so the cleanup
|
|
takes that into account and cleans up resources from ongoing operations
|
|
accordingly, but that is incompatible with an Active/Active deployment.
|
|
|
|
The incompatibility comes from the fact that volume services on startup look on
|
|
the DB for resources that are in the middle of an operation and are from their
|
|
own storage back-end - detected by the ``host`` field - and proceed to clean
|
|
them up depending on the state they are in. For example a ``downloading``
|
|
volume will be changed to ``error`` since the download was interrupted and we
|
|
cannot recover from it.
|
|
|
|
With the new job distribution mechanism the ``host`` field will contain the
|
|
host configuration of the volume service that created the resource, but that
|
|
resource may now be in use by another volume service from the same cluster, so
|
|
we cannot just rely on this ``host`` field for cleanup, as it may lead to
|
|
cleaning wrong resources or skipping the ones we should be cleaning.
|
|
|
|
When we are working with an Active/Active system we cannot just clean all
|
|
resources from our storage-backend that are in an ongoing state, since they may
|
|
be legitimate undergoing jobs being handled by other volume services.
|
|
|
|
We are going to forget for a moment how we are doing the cleanup right now and
|
|
focus on the different cleanup scenarios we have to cover. One is when a
|
|
volume service "dies" -by that we mean that it really stops working, or it is
|
|
fenced- and failover boots another volume service to replace it as if it were
|
|
the same service -having the same ``host`` and ``cluster`` configurations-, and
|
|
the other scenario is when the service dies and no other service takes its
|
|
place, or the service that takes its place shares the ``cluster`` configuration
|
|
but has a different ``host``.
|
|
|
|
Those are the cases we have to solve to be able to support Active/Active and
|
|
Active/Pasive configurations with proper cleanups.
|
|
|
|
|
|
Use Cases
|
|
=========
|
|
|
|
Operators that have hard requirements, SLA or other reasons, to have their
|
|
cloud operational at all times or have higher throughput requirements will want
|
|
to have the possibility to configure their deployments with an Active/Active
|
|
configuration and have proper cleanup of resources when services die.
|
|
|
|
|
|
Proposed change
|
|
===============
|
|
|
|
Since checking for the status and the ``host`` field of the resource is no
|
|
longer enough to know if it needs cleanup -because the ``host`` field will be
|
|
referring to the ``host`` configuration of the volume service that created the
|
|
resource and not the owner of the resource as explained in the `Job
|
|
Distribution`_ specs- we will create a new table to track which service from
|
|
the cluster is working on each resource.
|
|
|
|
We'll call this new table ``workers`` and it will include all resources that
|
|
are being processed with cleanable operations, and therefore would require
|
|
cleanup if the service that is doing the operation crashed.
|
|
|
|
When a cleanable job is requested by the API or any of the services -for
|
|
example a volume deletion can be requested by the API or by the c-vol service
|
|
during a migration- we will create a new row in the ``workers`` table with the
|
|
resource we are working on and who is working on it. And once the operation
|
|
has been completed -successfully or unsuccessfully- this row will be deleted to
|
|
indicate processing has concluded and a cleanup will no longer be needed if the
|
|
service dies.
|
|
|
|
We will not be adding a row for non cleanable operations and resources that are
|
|
used in cleanable operations but won't require cleanup, as this would create a
|
|
significant increase in DB operations that would end up affecting performance
|
|
of all operations.
|
|
|
|
These ``workers`` rows serve as *flags* for the cleanup mechanism to know it
|
|
must check that resource in case of a crash and see if it needs cleanup. There
|
|
can only exist 1 cleanable operation at a time for a given resource.
|
|
|
|
To ensure that both scenarios mentioned above are taken care of, we will have
|
|
cleanup code on cinder-volume and Scheduler services.
|
|
|
|
Cinder-volume service cleanups will be similar to the ones we currently have on
|
|
startup -``init_host`` method- but with small modifications to use the
|
|
``workers_table`` so services can tell which resources require cleanup because
|
|
they were left in the middle of an operation. With this we take care of one of
|
|
the scenarios, but we still have to consider the case where no replacement
|
|
volume service comes up with the same ``host`` configuration, and for that we
|
|
will add a mechanism on the scheduler that will take care of requesting other
|
|
volume service from the cluster, that manage the same backend, to do the
|
|
cleaning for the fallen service.
|
|
|
|
The cleanup mechanism implemented on the scheduler will have manual and
|
|
automatic options, manual option will require the caller to specify which
|
|
services should be cleaned up using filters, and automatic operation will let
|
|
the scheduler decide which services should be cleaned up based on their status
|
|
and how long they have been down.
|
|
|
|
Automatic cleanup mechanism will consist of a periodic task that will sample
|
|
services that are down, with a frequency of ``service_down_time`` seconds, and
|
|
will proceed to clean up resources that were left by those services that are
|
|
down after ``auto_cleanup_checks`` x ``service_down_time`` seconds have passed
|
|
since the service went down.
|
|
|
|
Since we can have multiple Scheduler services and the cinder-volume service all
|
|
trying to do the cleanup simultaneously, code needs to be able handle these
|
|
situations.
|
|
|
|
On one hand, to prevent multiple Schedulers from cleaning the same services's
|
|
resources they will be reporting all automatic cleanup operations requested to
|
|
the cinder-volumes to the other Scheduler services and will ask other scheduler
|
|
services which services have already been cleaned on service start.
|
|
|
|
On the other hand, to prevent cleanup concurrency issues if a cleanup is
|
|
requested on a service that is already being cleaned up, we will issue all
|
|
cleanup operations with a timestamp indicating that only ``workers`` entries
|
|
before that should be cleaned up, so when a service starts doing the cleanup
|
|
for a resource it updates the entry an prevents additional cleanup operations
|
|
on the resource.
|
|
|
|
Row deletion operations in ``workers`` table will be a real deletions in the
|
|
DB, not soft deletes like we do for other tables, because the number of
|
|
operations, and therefore of rows, will be quite high and because we will be
|
|
setting constraints on the rows that would not hold true if we had the same
|
|
resource multiple times (there are workarounds, but it doesn't seem to be worth
|
|
it).
|
|
|
|
Since these will be big, complex changes, we will not be enabling any kind of
|
|
automatic cleanup by default, and it will need to be either enabled in the
|
|
configuration using ``auto_cleanup_enabled`` option or triggered using the
|
|
manual cleanup API -using filters- or the automatic cleanup API.
|
|
|
|
It will be possible to trigger the automatic cleanup mechanism via the API even
|
|
when it is disabled, as the disabling only prevents it from being automatically
|
|
triggered.
|
|
|
|
It is important to mention that using "reset-state" operation on any resource
|
|
will remove any existing ``workers`` table entry in the DB.
|
|
|
|
When proceeding with a cleanup we will ensure that no other service is working
|
|
on that resource (claiming the ``worker``'s entry) and that the data on the
|
|
``workers`` entry is still valid for the given resource (status matches) since
|
|
a user may have forcefully issued another action on the resource in the
|
|
meantime..
|
|
|
|
|
|
Alternatives
|
|
------------
|
|
|
|
There are multiple alternatives to proposed change, the most appealing ones
|
|
are:
|
|
|
|
- Use Tooz with a DLM that allows Leader Election to prevent more than one
|
|
scheduler from doing cleanup of down services. Downsides to this solution
|
|
are considerable:
|
|
|
|
- Increased dependency on a DLM.
|
|
|
|
- Limiting DLM choices since now it needs to have Leader Election
|
|
functionality.
|
|
|
|
- We will still need to let other schedulers know when the leader does
|
|
cleanups because when electing a new leader will need this information to
|
|
determine if down services have already been cleaned.
|
|
|
|
- Create ``workers`` DB entries for every operation on a resource.
|
|
Disadvantages of this alternative are:
|
|
|
|
- Considerable performance impact.
|
|
|
|
- Greatly increase cleanup mechanism complexity, as we would need to mark all
|
|
entries as being processed by the service we are going to clean (this has
|
|
its own complexity because multiple schedulers could be requesting it or a
|
|
scheduler and the service itself), then see which of those resources would
|
|
require cleanup according to the ``workers`` table and check if no other
|
|
service is already working on that resource because a user decided to do a
|
|
cleanup on his own (for example a force delete on a deleting resource) and
|
|
if there's no other service working on the resource and the resource has a
|
|
status that is cleanable, then do the cleanup. Doing all this without
|
|
races is quite complicated.
|
|
|
|
Data model impact
|
|
-----------------
|
|
|
|
Create new `workers` table with following fields:
|
|
|
|
- ``id``: To uniquely identify each entry and speed up some operations
|
|
- ``created_at``: To mark when the job was started at the API
|
|
- ``updated_at``: To mark when the job was last touched (API, SCH, VOL)
|
|
- ``deleted_at``: Will not be used
|
|
- ``resource_type``: Resource type (Volume, Backup, Snapshot...)
|
|
- ``resource_id``: UUID of the resource
|
|
- ``status``: The status that should be cleaned on service failure
|
|
- ``service_id``: service working on the resource
|
|
|
|
|
|
REST API impact
|
|
---------------
|
|
|
|
Two new admin only API endpoint will be created, ``/workers/cleanup`` and
|
|
``/workers/auto_cleanup``.
|
|
|
|
For ``/workers/cleanup`` endpoint we will be able to supply filtering
|
|
parameters, but if no arguments are provided cleanup will issue a clean message
|
|
for all services that are down. But we can restrict which services we want to
|
|
be cleaned using parameters `service_id`, `cluster_name`, `host`, `binary`,
|
|
`disabled`.
|
|
|
|
Cleaning specific resources is also possible using `resource_type` and
|
|
`resource_id` parameters.
|
|
|
|
Cleanup cannot be triggered during a cloud upgrade, but a restarted service
|
|
will still cleanup it's own resources during an upgrade.
|
|
|
|
Both API endpoints will return a dictionary with 2 lists, one with services
|
|
that have been issued a cleanup request (`cleaning`) and another list with
|
|
services that cannot be cleaned right now because there is no alternative
|
|
service to do the cleanup in that cluster (`unavailable`), that way the caller
|
|
can know which services will be cleaned up.
|
|
|
|
Data returned for each service in the lists are `id`, `name`, and `state`
|
|
fields.
|
|
|
|
Security impact
|
|
---------------
|
|
|
|
None
|
|
|
|
Notifications impact
|
|
--------------------
|
|
|
|
None
|
|
|
|
Other end user impact
|
|
---------------------
|
|
|
|
None
|
|
|
|
Performance Impact
|
|
------------------
|
|
|
|
Small impact on cleanable operations since we have to use the ``workers`` table
|
|
to *flag* that we are working on the resource.
|
|
|
|
Other deployer impact
|
|
---------------------
|
|
|
|
None
|
|
|
|
Developer impact
|
|
----------------
|
|
|
|
Any developer that wants to add new resources requiring cleanup or wants add
|
|
cleanup for the status -new or existing- of an existing resource will have to
|
|
use the new mechanism to mark the resource as cleanable, add which states are
|
|
cleanable, and add the cleanup code.
|
|
|
|
|
|
Implementation
|
|
==============
|
|
|
|
Assignee(s)
|
|
-----------
|
|
|
|
Primary assignee:
|
|
Gorka Eguileor (geguileo)
|
|
|
|
Other contributors:
|
|
Michal Dulko (dulek)
|
|
Anyone is welcome to help
|
|
|
|
Work Items
|
|
----------
|
|
|
|
- Make DB changes to add the new ``workers`` table.
|
|
|
|
- Implement adding rows to ``workers`` table.
|
|
|
|
- Change ``host_init`` to use an RPC call for the cleanup.
|
|
|
|
- Modify Scheduler code to do cleanups.
|
|
|
|
- Create devref explaining requirements to add cleanup resources/statuses.
|
|
|
|
|
|
Dependencies
|
|
============
|
|
|
|
`Job Distribution`_:
|
|
- This depends on the job distribution mechanism so the cleanup can be done by
|
|
any available service from the same cluster.
|
|
|
|
Testing
|
|
=======
|
|
|
|
Unittests for new cleanup behavior.
|
|
|
|
|
|
Documentation Impact
|
|
====================
|
|
|
|
Document new configuration option ``auto_cleanup_enabled`` and
|
|
``auto_cleanup_checks`` as well as the cleanup mechanism.
|
|
|
|
Document behavior of reset-state on Active-Active deployment.
|
|
|
|
|
|
References
|
|
==========
|
|
|
|
General Description for HA A/A: https://review.openstack.org/232599
|
|
|
|
Job Distribution for HA A/A: https://review.openstack.org/327283
|
|
|
|
.. _Job Distribution: https://review.openstack.org/327283
|