Zohar Mamedov ce4aa86453 NVMe monitoring and healing agent for NVMe connector.

Depends-On: https://review.opendev.org/c/openstack/cinder-specs/+/766730
Implements: blueprint nvmeof-client-raid-healing-agent
Change-Id: I9b76fc4b1f13ddf07769136ec975148e1e109ca8

2020-12-23 04:43:18 +00:00

5.9 KiB

Raw Blame History

NVMe Connector Healing Agent

https://blueprints.launchpad.net/cinder/+spec/nvmeof-client-raid-healing-agent

Daemon that monitors NVMe connections and MDRAID arrays created by the NVMe connector, identifies faulted volume replicas, requests new replicas and replaces faulted replicas with new ones.

Problem description

When the NVMe connector connects a replicated volume, OpenStack will see it as one volume, and has no way of monitoring managing and healing the replicas in these MDRAID arrays. This agent will take care of that.

It will monitor the state of the MDRAID arrays and reconcile their physical state on the host with expected state from the volume provisioner, replacing broken legs.

For backend volume replicas, it's the storage array that takes care of monitoring and replacing unhealthy replicas.

NVMe MDRAID moves the data replication responsibility from the backend to the consumer.

Currently there's no mechanism to monitor and heal these replicated volumes.

We cannot do it on the Cinder side, because even if the Cinder driver detected the issue and created a replacing volume, we have no mechanism to report the connection information of the replacing volume to the consumer.

So the monitoring and healing needs to be on the volume consumer side.

This agent will also be greatly beneficial for scenarios where certain replicas of an attached replicated volume go faulty, by notifying the volume provisioner of the faulty devices, they can be marked as faulty to avoid using old data on re-attachments and to replace them entirely.

Use Cases

When working with replicated NVMe volumes that are attached to an instance for a long time, one of the replicas may go faulty. This agent will detect it and attempt to replace it (self heal the MDRAID, without the need to detach and re-attach the volume).

Proposed change

Add an "NVMe agent" class that will be initialized by the NVMe connector during volume connection on a host.

Initializing this agent will spawn a monitoring task which will repeat periodically. We are proposing this to be a native thread if possible, but if necessary it can be an independent process.

First proposal was to use python Event Scheduler sched.scheduler, but other alternatives, such as spawning a separate process communicated to via socket, may be chosen instead. One key problem that would need to be addressed by this selection is a scenario where compute service goes down, while the VMs continue operating (and their volumes remain attached) - we don't want to lose this agent in this case.

When initialized, the agent will read access information to the volume provisioner from a pre-determined config file location, with vendor specific format, the content of which should be provided there by the systems operator.

The task will monitor NVMe devices and MDRAID arrays built over them.

It will know which NVMe devices and MDRAID arrays to monitor based on metadata from the volume provisioner (backend) - which it will have a custom interface to.

It will notify volume provisioner if necessary of failed devices.

It will attempt to connect to new NVMe devices / replicas, replacing them in the MDRAID.

Typical self healing flow:

volume replica goes faulty
agent notices faulty replica, reports to provisioner
provisioner marks replica as bad (so it wont be used later unless synced)
agent keeps pulling volume information from provisioner
certain grace period passes, agent sees no state changes of faulty replica from provisioner, so it sends explicit request to replace replica
provisioner replaces replica and updates volume information
agent pulls volume replica information, notices a replica has changed
agent carries out replica replacement

Alternatives

Operator could use some own script to monitor connections and fix them manually

Data model impact

None

REST API impact

None

Security impact

Will call NVMe connector methods that do sudo executions of nvme and mdadm This will happen in the new agent task that will be spawned from os-brick.

Active/Active HA impact

None

Notifications impact

None

Other end user impact

None

Performance Impact

None

Other deployer impact

None

Developer impact

To allow multiple vendor implementations, the specific methods / logic for:

probing the volume provisioner
pulling / parsing volume metadata from provisioner
reporting volume state changes to provisioner
requesting provisioner to replace replica

Will need to be implemented on a per vendor basis.

The architecture is such that the agent will be a generic class that will provide the interface, and the kioxia implementation will be the first example of vendor-specific implementation.

Implementation

Assignee(s)

Zohar Mamedov: zoharm

Work Items

NVMe connector will launch monitoring task on connect_volume if not running.

Task monitors NVMe devices and MDRAID arrays created by the connector.

When a replica goes faulty (as well as other events such as disconnects) call interface method for notifying volume provisioner.

When replicated volume devices are changed by the volume provisioner, reconcile the physical state of NVMe devices and MDRAID arrays on the host.

Dependencies

None

Testing

We should be able to accept this with just unit tests.

Documentation Impact

Document that using NVMe connector with replicated volumes will optionally launch this agent.

References

Architectural diagram https://wiki.openstack.org/wiki/File:Nvme-of-add-client-raid1-detail.png

5.9 KiB Raw Blame History

NVMe Connector Healing Agent

Problem description

Use Cases

Proposed change

Alternatives

Data model impact

REST API impact

Security impact

Active/Active HA impact

Notifications impact

Other end user impact

Performance Impact

Other deployer impact

Developer impact

Implementation

Assignee(s)

Work Items

Dependencies

Testing

Documentation Impact

References

5.9 KiB

Raw Blame History