Merge "Resource cleanup to support HA A/A"
This commit is contained in:
commit
41b6bda218
341
specs/newton/ha-aa-cleanup.rst
Normal file
341
specs/newton/ha-aa-cleanup.rst
Normal file
@ -0,0 +1,341 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
=============================================================
|
||||
Cinder Volume Active/Active support - Cleanup
|
||||
=============================================================
|
||||
|
||||
https://blueprints.launchpad.net/cinder/+spec/cinder-volume-active-active-support
|
||||
|
||||
Right now cinder-volume service can run only in Active/Passive HA fashion.
|
||||
|
||||
One of the reasons for this is that we have no concept of a cluster of nodes
|
||||
that handle the same storage back-end, and we assume only one volume service
|
||||
can access a specific storage back-end.
|
||||
|
||||
Given this premise, current code handles the cleanup for failed volume services
|
||||
as if no other service is working with resources from his back-end, and that is
|
||||
problematic when there are other volume services working with those resources,
|
||||
as is the case on an Active/Active configuration.
|
||||
|
||||
This spec introduces a new cleanup mechanism and modifies current cleanup
|
||||
mechanism so proper cleanup is done regardless of cinder configuration,
|
||||
Active/Passive or Active/Active.
|
||||
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
Current Cinder code only supports Active/Passive configurations, so the cleanup
|
||||
takes that into account and cleans up resources from ongoing operations
|
||||
accordingly, but that is incompatible with an Active/Active deployment.
|
||||
|
||||
The incompatibility comes from the fact that volume services on startup look on
|
||||
the DB for resources that are in the middle of an operation and are from their
|
||||
own storage back-end - detected by the ``host`` field - and proceed to clean
|
||||
them up depending on the state they are in. For example a ``downloading``
|
||||
volume will be changed to ``error`` since the download was interrupted and we
|
||||
cannot recover from it.
|
||||
|
||||
With the new job distribution mechanism the ``host`` field will contain the
|
||||
host configuration of the volume service that created the resource, but that
|
||||
resource may now be in use by another volume service from the same cluster, so
|
||||
we cannot just rely on this ``host`` field for cleanup, as it may lead to
|
||||
cleaning wrong resources or skipping the ones we should be cleaning.
|
||||
|
||||
When we are working with an Active/Active system we cannot just clean all
|
||||
resources from our storage-backend that are in an ongoing state, since they may
|
||||
be legitimate undergoing jobs being handled by other volume services.
|
||||
|
||||
We are going to forget for a moment how we are doing the cleanup right now and
|
||||
focus on the different cleanup scenarios we have to cover. One is when a
|
||||
volume service "dies" -by that we mean that it really stops working, or it is
|
||||
fenced- and failover boots another volume service to replace it as if it were
|
||||
the same service -having the same ``host`` and ``cluster`` configurations-, and
|
||||
the other scenario is when the service dies and no other service takes its
|
||||
place, or the service that takes its place shares the ``cluster`` configuration
|
||||
but has a different ``host``.
|
||||
|
||||
Those are the cases we have to solve to be able to support Active/Active and
|
||||
Active/Pasive configurations with proper cleanups.
|
||||
|
||||
|
||||
Use Cases
|
||||
=========
|
||||
|
||||
Operators that have hard requirements, SLA or other reasons, to have their
|
||||
cloud operational at all times or have higher throughput requirements will want
|
||||
to have the possibility to configure their deployments with an Active/Active
|
||||
configuration and have proper cleanup of resources when services die.
|
||||
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
Since checking for the status and the ``host`` field of the resource is no
|
||||
longer enough to know if it needs cleanup -because the ``host`` field will be
|
||||
referring to the ``host`` configuration of the volume service that created the
|
||||
resource and not the owner of the resource as explained in the `Job
|
||||
Distribution`_ specs- we will create a new table to track which service from
|
||||
the cluster is working on each resource.
|
||||
|
||||
We'll call this new table ``workers`` and it will include all resources that
|
||||
are being processed with cleanable operations, and therefore would require
|
||||
cleanup if the service that is doing the operation crashed.
|
||||
|
||||
When a cleanable job is requested by the API or any of the services -for
|
||||
example a volume deletion can be requested by the API or by the c-vol service
|
||||
during a migration- we will create a new row in the ``workers`` table with the
|
||||
resource we are working on and who is working on it. And once the operation
|
||||
has been completed -successfully or unsuccessfully- this row will be deleted to
|
||||
indicate processing has concluded and a cleanup will no longer be needed if the
|
||||
service dies.
|
||||
|
||||
We will not be adding a row for non cleanable operations and resources that are
|
||||
used in cleanable operations but won't require cleanup, as this would create a
|
||||
significant increase in DB operations that would end up affecting performance
|
||||
of all operations.
|
||||
|
||||
These ``workers`` rows serve as *flags* for the cleanup mechanism to know it
|
||||
must check that resource in case of a crash and see if it needs cleanup. There
|
||||
can only exist 1 cleanable operation at a time for a given resource.
|
||||
|
||||
To ensure that both scenarios mentioned above are taken care of, we will have
|
||||
cleanup code on cinder-volume and Scheduler services.
|
||||
|
||||
Cinder-volume service cleanups will be similar to the ones we currently have on
|
||||
startup -``init_host`` method- but with small modifications to use the
|
||||
``workers_table`` so services can tell which resources require cleanup because
|
||||
they were left in the middle of an operation. With this we take care of one of
|
||||
the scenarios, but we still have to consider the case where no replacement
|
||||
volume service comes up with the same ``host`` configuration, and for that we
|
||||
will add a mechanism on the scheduler that will take care of requesting other
|
||||
volume service from the cluster, that manage the same backend, to do the
|
||||
cleaning for the fallen service.
|
||||
|
||||
The cleanup mechanism implemented on the scheduler will have manual and
|
||||
automatic options, manual option will require the caller to specify which
|
||||
services should be cleaned up using filters, and automatic operation will let
|
||||
the scheduler decide which services should be cleaned up based on their status
|
||||
and how long they have been down.
|
||||
|
||||
Automatic cleanup mechanism will consist of a periodic task that will sample
|
||||
services that are down, with a frequency of ``service_down_time`` seconds, and
|
||||
will proceed to clean up resources that were left by those services that are
|
||||
down after ``auto_cleanup_checks`` x ``service_down_time`` seconds have passed
|
||||
since the service went down.
|
||||
|
||||
Since we can have multiple Scheduler services and the cinder-volume service all
|
||||
trying to do the cleanup simultaneously, code needs to be able handle these
|
||||
situations.
|
||||
|
||||
On one hand, to prevent multiple Schedulers from cleaning the same services's
|
||||
resources they will be reporting all automatic cleanup operations requested to
|
||||
the cinder-volumes to the other Scheduler services and will ask other scheduler
|
||||
services which services have already been cleaned on service start.
|
||||
|
||||
On the other hand, to prevent cleanup concurrency issues if a cleanup is
|
||||
requested on a service that is already being cleaned up, we will issue all
|
||||
cleanup operations with a timestamp indicating that only ``workers`` entries
|
||||
before that should be cleaned up, so when a service starts doing the cleanup
|
||||
for a resource it updates the entry an prevents additional cleanup operations
|
||||
on the resource.
|
||||
|
||||
Row deletion operations in ``workers`` table will be a real deletions in the
|
||||
DB, not soft deletes like we do for other tables, because the number of
|
||||
operations, and therefore of rows, will be quite high and because we will be
|
||||
setting constraints on the rows that would not hold true if we had the same
|
||||
resource multiple times (there are workarounds, but it doesn't seem to be worth
|
||||
it).
|
||||
|
||||
Since these will be big, complex changes, we will not be enabling any kind of
|
||||
automatic cleanup by default, and it will need to be either enabled in the
|
||||
configuration using ``auto_cleanup_enabled`` option or triggered using the
|
||||
manual cleanup API -using filters- or the automatic cleanup API.
|
||||
|
||||
It will be possible to trigger the automatic cleanup mechanism via the API even
|
||||
when it is disabled, as the disabling only prevents it from being automatically
|
||||
triggered.
|
||||
|
||||
It is important to mention that using "reset-state" operation on any resource
|
||||
will remove any existing ``workers`` table entry in the DB.
|
||||
|
||||
When proceeding with a cleanup we will ensure that no other service is working
|
||||
on that resource (claiming the ``worker``'s entry) and that the data on the
|
||||
``workers`` entry is still valid for the given resource (status matches) since
|
||||
a user may have forcefully issued another action on the resource in the
|
||||
meantime..
|
||||
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
There are multiple alternatives to proposed change, the most appealing ones
|
||||
are:
|
||||
|
||||
- Use Tooz with a DLM that allows Leader Election to prevent more than one
|
||||
scheduler from doing cleanup of down services. Downsides to this solution
|
||||
are considerable:
|
||||
|
||||
- Increased dependency on a DLM.
|
||||
|
||||
- Limiting DLM choices since now it needs to have Leader Election
|
||||
functionality.
|
||||
|
||||
- We will still need to let other schedulers know when the leader does
|
||||
cleanups because when electing a new leader will need this information to
|
||||
determine if down services have already been cleaned.
|
||||
|
||||
- Create ``workers`` DB entries for every operation on a resource.
|
||||
Disadvantages of this alternative are:
|
||||
|
||||
- Considerable performance impact.
|
||||
|
||||
- Greatly increase cleanup mechanism complexity, as we would need to mark all
|
||||
entries as being processed by the service we are going to clean (this has
|
||||
its own complexity because multiple schedulers could be requesting it or a
|
||||
scheduler and the service itself), then see which of those resources would
|
||||
require cleanup according to the ``workers`` table and check if no other
|
||||
service is already working on that resource because a user decided to do a
|
||||
cleanup on his own (for example a force delete on a deleting resource) and
|
||||
if there's no other service working on the resource and the resource has a
|
||||
status that is cleanable, then do the cleanup. Doing all this without
|
||||
races is quite complicated.
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
Create new `workers` table with following fields:
|
||||
|
||||
- ``id``: To uniquely identify each entry and speed up some operations
|
||||
- ``created_at``: To mark when the job was started at the API
|
||||
- ``updated_at``: To mark when the job was last touched (API, SCH, VOL)
|
||||
- ``deleted_at``: Will not be used
|
||||
- ``resource_type``: Resource type (Volume, Backup, Snapshot...)
|
||||
- ``resource_id``: UUID of the resource
|
||||
- ``status``: The status that should be cleaned on service failure
|
||||
- ``service_id``: service working on the resource
|
||||
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
Two new admin only API endpoint will be created, ``/workers/cleanup`` and
|
||||
``/workers/auto_cleanup``.
|
||||
|
||||
For ``/workers/cleanup`` endpoint we will be able to supply filtering
|
||||
parameters, but if no arguments are provided cleanup will issue a clean message
|
||||
for all services that are down. But we can restrict which services we want to
|
||||
be cleaned using parameters `service_id`, `cluster_name`, `host`, `binary`,
|
||||
`disabled`.
|
||||
|
||||
Cleaning specific resources is also possible using `resource_type` and
|
||||
`resource_id` parameters.
|
||||
|
||||
Cleanup cannot be triggered during a cloud upgrade, but a restarted service
|
||||
will still cleanup it's own resources during an upgrade.
|
||||
|
||||
Both API endpoints will return a dictionary with 2 lists, one with services
|
||||
that have been issued a cleanup request (`cleaning`) and another list with
|
||||
services that cannot be cleaned right now because there is no alternative
|
||||
service to do the cleanup in that cluster (`unavailable`), that way the caller
|
||||
can know which services will be cleaned up.
|
||||
|
||||
Data returned for each service in the lists are `id`, `name`, and `state`
|
||||
fields.
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
None
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
Small impact on cleanable operations since we have to use the ``workers`` table
|
||||
to *flag* that we are working on the resource.
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
Any developer that wants to add new resources requiring cleanup or wants add
|
||||
cleanup for the status -new or existing- of an existing resource will have to
|
||||
use the new mechanism to mark the resource as cleanable, add which states are
|
||||
cleanable, and add the cleanup code.
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
Gorka Eguileor (geguileo)
|
||||
|
||||
Other contributors:
|
||||
Michal Dulko (dulek)
|
||||
Anyone is welcome to help
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
- Make DB changes to add the new ``workers`` table.
|
||||
|
||||
- Implement adding rows to ``workers`` table.
|
||||
|
||||
- Change ``host_init`` to use an RPC call for the cleanup.
|
||||
|
||||
- Modify Scheduler code to do cleanups.
|
||||
|
||||
- Create devref explaining requirements to add cleanup resources/statuses.
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
`Job Distribution`_:
|
||||
- This depends on the job distribution mechanism so the cleanup can be done by
|
||||
any available service from the same cluster.
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Unittests for new cleanup behavior.
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
Document new configuration option ``auto_cleanup_enabled`` and
|
||||
``auto_cleanup_checks`` as well as the cleanup mechanism.
|
||||
|
||||
Document behavior of reset-state on Active-Active deployment.
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
General Description for HA A/A: https://review.openstack.org/232599
|
||||
|
||||
Job Distribution for HA A/A: https://review.openstack.org/327283
|
||||
|
||||
.. _Job Distribution: https://review.openstack.org/327283
|
Loading…
x
Reference in New Issue
Block a user