NVMe monitoring and healing agent for NVMe connector.
Depends-On: https://review.opendev.org/c/openstack/cinder-specs/+/766730 Implements: blueprint nvmeof-client-raid-healing-agent Change-Id: I9b76fc4b1f13ddf07769136ec975148e1e109ca8
This commit is contained in:
parent
30466d896b
commit
ce4aa86453
211
specs/wallaby/nvme-agent.rst
Normal file
211
specs/wallaby/nvme-agent.rst
Normal file
@ -0,0 +1,211 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
============================
|
||||
NVMe Connector Healing Agent
|
||||
============================
|
||||
|
||||
https://blueprints.launchpad.net/cinder/+spec/nvmeof-client-raid-healing-agent
|
||||
|
||||
Daemon that monitors NVMe connections and MDRAID arrays created by the
|
||||
NVMe connector, identifies faulted volume replicas, requests new replicas
|
||||
and replaces faulted replicas with new ones.
|
||||
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
When the NVMe connector connects a replicated volume, OpenStack will see it
|
||||
as one volume, and has no way of monitoring managing and healing the replicas
|
||||
in these MDRAID arrays. This agent will take care of that.
|
||||
|
||||
It will monitor the state of the MDRAID arrays and reconcile their physical
|
||||
state on the host with expected state from the volume provisioner, replacing
|
||||
broken legs.
|
||||
|
||||
For backend volume replicas, it's the storage array that takes care of
|
||||
monitoring and replacing unhealthy replicas.
|
||||
|
||||
NVMe MDRAID moves the data replication responsibility from the backend to
|
||||
the consumer.
|
||||
|
||||
Currently there's no mechanism to monitor and heal these replicated volumes.
|
||||
|
||||
We cannot do it on the Cinder side, because even if the Cinder driver detected
|
||||
the issue and created a replacing volume, we have no mechanism to report the
|
||||
connection information of the replacing volume to the consumer.
|
||||
|
||||
So the monitoring and healing needs to be on the volume consumer side.
|
||||
|
||||
This agent will also be greatly beneficial for scenarios where certain replicas
|
||||
of an attached replicated volume go faulty, by notifying the volume provisioner
|
||||
of the faulty devices, they can be marked as faulty to avoid using old data on
|
||||
re-attachments and to replace them entirely.
|
||||
|
||||
|
||||
Use Cases
|
||||
=========
|
||||
|
||||
When working with replicated NVMe volumes that are attached to an instance
|
||||
for a long time, one of the replicas may go faulty.
|
||||
This agent will detect it and attempt to replace it (self heal the MDRAID,
|
||||
without the need to detach and re-attach the volume).
|
||||
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
Add an "NVMe agent" class that will be initialized by the NVMe connector
|
||||
during volume connection on a host.
|
||||
|
||||
Initializing this agent will spawn a monitoring task which will repeat
|
||||
periodically. We are proposing this to be a native thread if possible,
|
||||
but if necessary it can be an independent process.
|
||||
|
||||
First proposal was to use python Event Scheduler `sched.scheduler`, but other
|
||||
alternatives, such as spawning a separate process communicated to via socket,
|
||||
may be chosen instead.
|
||||
One key problem that would need to be addressed by this selection is a scenario
|
||||
where compute service goes down, while the VMs continue operating (and their
|
||||
volumes remain attached) - we don't want to lose this agent in this case.
|
||||
|
||||
When initialized, the agent will read access information to the volume
|
||||
provisioner from a pre-determined config file location, with vendor specific
|
||||
format, the content of which should be provided there by the systems operator.
|
||||
|
||||
The task will monitor NVMe devices and MDRAID arrays built over them.
|
||||
|
||||
It will know which NVMe devices and MDRAID arrays to monitor based on metadata
|
||||
from the volume provisioner (backend) - which it will have a custom interface
|
||||
to.
|
||||
|
||||
It will notify volume provisioner if necessary of failed devices.
|
||||
|
||||
It will attempt to connect to new NVMe devices / replicas, replacing them
|
||||
in the MDRAID.
|
||||
|
||||
Typical self healing flow:
|
||||
|
||||
1. volume replica goes faulty
|
||||
2. agent notices faulty replica, reports to provisioner
|
||||
3. provisioner marks replica as bad (so it wont be used later unless synced)
|
||||
4. agent keeps pulling volume information from provisioner
|
||||
5. certain grace period passes, agent sees no state changes of faulty replica
|
||||
from provisioner, so it sends explicit request to replace replica
|
||||
6. provisioner replaces replica and updates volume information
|
||||
7. agent pulls volume replica information, notices a replica has changed
|
||||
8. agent carries out replica replacement
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
Operator could use some own script to monitor connections and fix them manually
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
None
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
Will call NVMe connector methods that do sudo executions of `nvme` and `mdadm`
|
||||
This will happen in the new agent task that will be spawned from os-brick.
|
||||
|
||||
Active/Active HA impact
|
||||
-----------------------
|
||||
|
||||
None
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
None
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
None
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
To allow multiple vendor implementations, the specific methods / logic for:
|
||||
|
||||
- probing the volume provisioner
|
||||
- pulling / parsing volume metadata from provisioner
|
||||
- reporting volume state changes to provisioner
|
||||
- requesting provisioner to replace replica
|
||||
|
||||
Will need to be implemented on a per vendor basis.
|
||||
|
||||
The architecture is such that the agent will be a generic class that will
|
||||
provide the interface, and the kioxia implementation will be the first
|
||||
example of vendor-specific implementation.
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Zohar Mamedov
|
||||
zoharm
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
NVMe connector will launch monitoring task on connect_volume if not running.
|
||||
|
||||
Task monitors NVMe devices and MDRAID arrays created by the connector.
|
||||
|
||||
When a replica goes faulty (as well as other events such as disconnects)
|
||||
call interface method for notifying volume provisioner.
|
||||
|
||||
When replicated volume devices are changed by the volume provisioner,
|
||||
reconcile the physical state of NVMe devices and MDRAID arrays on the host.
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None
|
||||
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
We should be able to accept this with just unit tests.
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
Document that using NVMe connector with replicated volumes will optionally
|
||||
launch this agent.
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
Architectural diagram
|
||||
https://wiki.openstack.org/wiki/File:Nvme-of-add-client-raid1-detail.png
|
Loading…
x
Reference in New Issue
Block a user