This spec outlines work needed to modify resource cleanup for clusters with Active/Active configuration in cinder. This is one of the aspects that are needed to support Active/Active in cinder. Implements: blueprint cinder-volume-active-active-support Change-Id: Ibd448993f0b6c6396cba8e8edd7836145833605a
13 KiB
Cinder Volume Active/Active support - Cleanup
https://blueprints.launchpad.net/cinder/+spec/cinder-volume-active-active-support
Right now cinder-volume service can run only in Active/Passive HA fashion.
One of the reasons for this is that we have no concept of a cluster of nodes that handle the same storage back-end, and we assume only one volume service can access a specific storage back-end.
Given this premise, current code handles the cleanup for failed volume services as if no other service is working with resources from his back-end, and that is problematic when there are other volume services working with those resources, as is the case on an Active/Active configuration.
This spec introduces a new cleanup mechanism and modifies current cleanup mechanism so proper cleanup is done regardless of cinder configuration, Active/Passive or Active/Active.
Problem description
Current Cinder code only supports Active/Passive configurations, so the cleanup takes that into account and cleans up resources from ongoing operations accordingly, but that is incompatible with an Active/Active deployment.
The incompatibility comes from the fact that volume services on
startup look on the DB for resources that are in the middle of an
operation and are from their own storage back-end - detected by the
host
field - and proceed to clean them up depending on the
state they are in. For example a downloading
volume will be
changed to error
since the download was interrupted and we
cannot recover from it.
With the new job distribution mechanism the host
field
will contain the host configuration of the volume service that created
the resource, but that resource may now be in use by another volume
service from the same cluster, so we cannot just rely on this
host
field for cleanup, as it may lead to cleaning wrong
resources or skipping the ones we should be cleaning.
When we are working with an Active/Active system we cannot just clean all resources from our storage-backend that are in an ongoing state, since they may be legitimate undergoing jobs being handled by other volume services.
We are going to forget for a moment how we are doing the cleanup
right now and focus on the different cleanup scenarios we have to cover.
One is when a volume service "dies" -by that we mean that it really
stops working, or it is fenced- and failover boots another volume
service to replace it as if it were the same service -having the same
host
and cluster
configurations-, and the
other scenario is when the service dies and no other service takes its
place, or the service that takes its place shares the
cluster
configuration but has a different
host
.
Those are the cases we have to solve to be able to support Active/Active and Active/Pasive configurations with proper cleanups.
Use Cases
Operators that have hard requirements, SLA or other reasons, to have their cloud operational at all times or have higher throughput requirements will want to have the possibility to configure their deployments with an Active/Active configuration and have proper cleanup of resources when services die.
Proposed change
Since checking for the status and the host
field of the
resource is no longer enough to know if it needs cleanup -because the
host
field will be referring to the host
configuration of the volume service that created the resource and not
the owner of the resource as explained in the Job Distribution specs-
we will create a new table to track which service from the cluster is
working on each resource.
We'll call this new table workers
and it will include
all resources that are being processed with cleanable operations, and
therefore would require cleanup if the service that is doing the
operation crashed.
When a cleanable job is requested by the API or any of the services
-for example a volume deletion can be requested by the API or by the
c-vol service during a migration- we will create a new row in the
workers
table with the resource we are working on and who
is working on it. And once the operation has been completed
-successfully or unsuccessfully- this row will be deleted to indicate
processing has concluded and a cleanup will no longer be needed if the
service dies.
We will not be adding a row for non cleanable operations and resources that are used in cleanable operations but won't require cleanup, as this would create a significant increase in DB operations that would end up affecting performance of all operations.
These workers
rows serve as flags for the
cleanup mechanism to know it must check that resource in case of a crash
and see if it needs cleanup. There can only exist 1 cleanable operation
at a time for a given resource.
To ensure that both scenarios mentioned above are taken care of, we will have cleanup code on cinder-volume and Scheduler services.
Cinder-volume service cleanups will be similar to the ones we
currently have on startup -init_host
method- but with small
modifications to use the workers_table
so services can tell
which resources require cleanup because they were left in the middle of
an operation. With this we take care of one of the scenarios, but we
still have to consider the case where no replacement volume service
comes up with the same host
configuration, and for that we
will add a mechanism on the scheduler that will take care of requesting
other volume service from the cluster, that manage the same backend, to
do the cleaning for the fallen service.
The cleanup mechanism implemented on the scheduler will have manual and automatic options, manual option will require the caller to specify which services should be cleaned up using filters, and automatic operation will let the scheduler decide which services should be cleaned up based on their status and how long they have been down.
Automatic cleanup mechanism will consist of a periodic task that will
sample services that are down, with a frequency of
service_down_time
seconds, and will proceed to clean up
resources that were left by those services that are down after
auto_cleanup_checks
x service_down_time
seconds have passed since the service went down.
Since we can have multiple Scheduler services and the cinder-volume service all trying to do the cleanup simultaneously, code needs to be able handle these situations.
On one hand, to prevent multiple Schedulers from cleaning the same services's resources they will be reporting all automatic cleanup operations requested to the cinder-volumes to the other Scheduler services and will ask other scheduler services which services have already been cleaned on service start.
On the other hand, to prevent cleanup concurrency issues if a cleanup
is requested on a service that is already being cleaned up, we will
issue all cleanup operations with a timestamp indicating that only
workers
entries before that should be cleaned up, so when a
service starts doing the cleanup for a resource it updates the entry an
prevents additional cleanup operations on the resource.
Row deletion operations in workers
table will be a real
deletions in the DB, not soft deletes like we do for other tables,
because the number of operations, and therefore of rows, will be quite
high and because we will be setting constraints on the rows that would
not hold true if we had the same resource multiple times (there are
workarounds, but it doesn't seem to be worth it).
Since these will be big, complex changes, we will not be enabling any
kind of automatic cleanup by default, and it will need to be either
enabled in the configuration using auto_cleanup_enabled
option or triggered using the manual cleanup API -using filters- or the
automatic cleanup API.
It will be possible to trigger the automatic cleanup mechanism via the API even when it is disabled, as the disabling only prevents it from being automatically triggered.
It is important to mention that using "reset-state" operation on any
resource will remove any existing workers
table entry in
the DB.
When proceeding with a cleanup we will ensure that no other service
is working on that resource (claiming the worker
's entry)
and that the data on the workers
entry is still valid for
the given resource (status matches) since a user may have forcefully
issued another action on the resource in the meantime..
Alternatives
There are multiple alternatives to proposed change, the most appealing ones are:
- Use Tooz with a DLM that allows Leader Election to prevent more than
one scheduler from doing cleanup of down services. Downsides to this
solution are considerable:
- Increased dependency on a DLM.
- Limiting DLM choices since now it needs to have Leader Election functionality.
- We will still need to let other schedulers know when the leader does cleanups because when electing a new leader will need this information to determine if down services have already been cleaned.
- Create
workers
DB entries for every operation on a resource. Disadvantages of this alternative are:- Considerable performance impact.
- Greatly increase cleanup mechanism complexity, as we would need to
mark all entries as being processed by the service we are going to clean
(this has its own complexity because multiple schedulers could be
requesting it or a scheduler and the service itself), then see which of
those resources would require cleanup according to the
workers
table and check if no other service is already working on that resource because a user decided to do a cleanup on his own (for example a force delete on a deleting resource) and if there's no other service working on the resource and the resource has a status that is cleanable, then do the cleanup. Doing all this without races is quite complicated.
Data model impact
Create new workers table with following fields:
id
: To uniquely identify each entry and speed up some operationscreated_at
: To mark when the job was started at the APIupdated_at
: To mark when the job was last touched (API, SCH, VOL)deleted_at
: Will not be usedresource_type
: Resource type (Volume, Backup, Snapshot...)resource_id
: UUID of the resourcestatus
: The status that should be cleaned on service failureservice_id
: service working on the resource
REST API impact
Two new admin only API endpoint will be created,
/workers/cleanup
and
/workers/auto_cleanup
.
For /workers/cleanup
endpoint we will be able to supply
filtering parameters, but if no arguments are provided cleanup will
issue a clean message for all services that are down. But we can
restrict which services we want to be cleaned using parameters service_id, cluster_name, host, binary,
disabled.
Cleaning specific resources is also possible using resource_type and resource_id parameters.
Cleanup cannot be triggered during a cloud upgrade, but a restarted service will still cleanup it's own resources during an upgrade.
Both API endpoints will return a dictionary with 2 lists, one with services that have been issued a cleanup request (cleaning) and another list with services that cannot be cleaned right now because there is no alternative service to do the cleanup in that cluster (unavailable), that way the caller can know which services will be cleaned up.
Data returned for each service in the lists are id, name, and state fields.
Security impact
None
Notifications impact
None
Other end user impact
None
Performance Impact
Small impact on cleanable operations since we have to use the
workers
table to flag that we are working on the
resource.
Other deployer impact
None
Developer impact
Any developer that wants to add new resources requiring cleanup or wants add cleanup for the status -new or existing- of an existing resource will have to use the new mechanism to mark the resource as cleanable, add which states are cleanable, and add the cleanup code.
Implementation
Assignee(s)
- Primary assignee:
-
Gorka Eguileor (geguileo)
- Other contributors:
-
Michal Dulko (dulek) Anyone is welcome to help
Work Items
- Make DB changes to add the new
workers
table. - Implement adding rows to
workers
table. - Change
host_init
to use an RPC call for the cleanup. - Modify Scheduler code to do cleanups.
- Create devref explaining requirements to add cleanup resources/statuses.
Dependencies
- Job Distribution:
-
- This depends on the job distribution mechanism so the cleanup can be done by any available service from the same cluster.
Testing
Unittests for new cleanup behavior.
Documentation Impact
Document new configuration option auto_cleanup_enabled
and auto_cleanup_checks
as well as the cleanup
mechanism.
Document behavior of reset-state on Active-Active deployment.
References
General Description for HA A/A: https://review.openstack.org/232599
Job Distribution for HA A/A: https://review.openstack.org/327283