Temporary Resource Tracking

This spec proposes a way to easily track which volumes and snapshots are
temporary and should not be considered when updating the usage quota.

Change-Id: I0a3f5836641dec535c2d2bf49cbf3a435faa8224
Implements: blueprint temp-resources
This commit is contained in:
Gorka Eguileor 2021-06-10 18:01:10 +02:00
parent e2abf973c7
commit 4090c987c4
2 changed files with 244 additions and 6 deletions

View File

@ -1,6 +0,0 @@
.. This file is a place holder. It should be removed by
any patch proposing a spec for the Xena release
================================
No specs have yet been approved.
================================

View File

@ -0,0 +1,244 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
===========================
Temporary Resource Tracking
===========================
https://blueprints.launchpad.net/cinder/+spec/temp-resources
Improve Cinder's temporary resource tracking to prevent related quota issues.
Problem description
===================
Cinder doesn't currently have a consistent way of tracking temporary resources,
which leads to quota bugs.
In some cases temporary volumes use the ``temporary`` key in the admin metadata
table to mark them and in other cases we determine a volume is temporary based
on its ``migration_status`` field, there are even cases where volumes are not
being marked as temporary. Due to this roundabout way of marking temporary
volumes and having multiple options makes our Cinder code error prone, as is
clear by the number of bugs around it.
As for temporary snapshots, Cinder doesn't currently have any way of reliably
tracking them, so the code creating temporary resources assumes that everything
will run smoothly and the deletion code in the method will be called after
successfully completing the operation. Sometimes that is not true, as the
operation could fail and leave the temporary resource behind, forcing users to
delete them manually, which messes up the quota, since the REST API delete call
doesn't know it shouldn't touch the quota.
When we say that we don't have a reliable way of tracking snapshots we refer to
the fact that even though snapshots have a name that helps identify them, such
as ``[revert] volume %s backup snapshot`` and ``backup-snap-%s``, these are
also valid snapshot names that a user can assign, so we cannot rely on them to
differentiate temporary snapshots.
Use Cases
=========
There are several cases where this feature will be useful:
* Revert to snapshot is configured to use a temporary snapshot, but either the
revert fails or the deletion of the temporary volume fails, so the user ends
up manually deleting the snapshot, and the quota is kept in sync with
reality.
* Creating a backup of an in-use volume when ``backup_use_temp_snapshot`` is
enabled fails, or the deletion of the temporary resource failed, forcing the
user to manually deleting the snapshot, and the user wants the quota to be
kept in sync with reality.
* A driver may have some slow code that gets triggered when cloning or creating
a snapshot for performance reasons but that would not be reasonable to
execute for temporary volumes. An example would be the flattening of cloned
volumes on the RBD driver.
Proposed change
===============
The proposed solution is to have an explicit DB field that indicates whether a
resource should be counted towards quota or not.
The field would be named ``use_quota`` and it would be added to the ``volumes``
and ``snapshots`` DB tables. We currently don't have temporary backups, so no
field would be added to the ``backups`` DB table.
This would replace the ``temporary`` admin metadata entry and the
``migration_status`` entry in 2 cycles, since we need to keep supporting
rolling upgrades where we could be running code that doesn't know about the new
``use_quota`` field.
Alternatives
------------
An alternative solution would be to use the ``temporary`` key in the volumes'
admin metadata table like we are doing in some case and create one such table
for snapshots as well.
With that alternative DB queries could become more complex, unlike with the
proposed solution where they would become simpler.
Data model impact
-----------------
Adds a ``use_quota`` DB field of type ``Boolean`` to both ``volumes`` and
``snapshots`` tables.
It will have an online data migration to set the ``use_quota`` field for
existing volumes as well as an updated ``save`` method for ``Volume`` and
``Snapshot`` OVOs that sets this field whenever they are saved.
REST API impact
---------------
There won't be any new REST API endpoint since the ``use_quota`` field is an
internal field and we don't want users or administrators modifying it.
But since this is useful information we will add this field to the volume's
JSON response for all endpoints that return it, although with a more user
oriented name ``consumes_quota``:
* Create volume
* Show volume
* Update volume
* List detailed volumes
* Create snapshot
* Show snapshot
* Update snapshot
* List detailed snapshots
Security impact
---------------
None.
Active/Active HA impact
-----------------------
None, since this mostly just affects whether quota code is called or not when
receiving REST API delete requests.
Notifications impact
--------------------
None.
Other end user impact
---------------------
The change requires a patch on the python-cinderclient to show the new returned
field ``consumes_quota``.
Performance Impact
------------------
There should be no performance detriment with this change, since the field
would be added at creation time and would not require additional DB queries.
Moreover performance improvements should be possible in the future once we
remove compatibility code with the current temporary volume checks, for example
not requiring writing to the admin metadata table, making quota sync
calculations directly on the DB, etc.
Other deployer impact
---------------------
None.
Developer impact
----------------
By default Volume and Snapshot OVOs will use quota on creation (set
``use_quota`` to ``True``) and when developers want to create temporary
resources that don't consume quota on creation or release it on deletion will
need to pass ``use_quota=False`` at creation time.
Also when doing quota (adding or removing) new code will have to check this
field in Volumes and Snapshots.
It will no longer be necessary to add additional admin metadata or check the
``migration_status``, which should make it coding easier and reduce the number
of related bugs.
Implementation
==============
Assignee(s)
-----------
Primary assignee:
Gorka Eguileor (geguileo)
Work Items
----------
* DB schema changes.
* DB online migration and OVO changes.
* Update existing operations that mark volumes as temporary to use the new
``use_quota`` field.
* Update operations that are not currently marking resources as temporary to do
so with the new ``use_quota`` field.
* REST API changes to return the ``use_quota`` field as ``consumes_quota``.
* Cinderclient changes.
Dependencies
============
None.
Testing
=======
No new tempest test will be added, since the case we want to fix is mostly
around error situations that we cannot force in tempest.
Unit tests will be provided as with any other patch.
Documentation Impact
====================
The API reference documentation will be updated.
References
==========
Proposed Cinder code implementation:
* https://review.opendev.org/c/openstack/cinder/+/786385
* https://review.opendev.org/c/openstack/cinder/+/786386
Proposed python-cinderclient code implementation:
* https://review.opendev.org/c/openstack/python-cinderclient/+/787407
Proposed code to leverage this new functionality in the RBD driver to not
flatten temporary resources:
* https://review.opendev.org/c/openstack/cinder/+/790492