![Zohar Mamedov](/assets/img/avatar_default.png)
Implements: blueprint nvmeof-connection-agent Change-Id: I94c1e8df684a25b0630af6490966b376dad6c047
238 lines
7.2 KiB
ReStructuredText
238 lines
7.2 KiB
ReStructuredText
..
|
|
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
|
License.
|
|
|
|
http://creativecommons.org/licenses/by/3.0/legalcode
|
|
|
|
=======================
|
|
NVMeoF Connection Agent
|
|
=======================
|
|
|
|
https://blueprints.launchpad.net/cinder/+spec/nvmeof-connection-agent
|
|
|
|
Daemon that monitors NVMeoF connections and MDRAID arrays created by the
|
|
new NVMeoF connector. It reports initiator-side events to the storage
|
|
orchestrator, identifies faulted volume replicas, requests new replicas and
|
|
replaces faulted replicas with newly assigned ones.
|
|
|
|
|
|
Problem description
|
|
===================
|
|
|
|
When the NVMe connector connects a client-replicated volume, OpenStack will see
|
|
it as one volume, and has no way of monitoring managing and healing the
|
|
replicas in these MDRAID arrays. This agent will take care of that.
|
|
|
|
Currently there's no mechanism to monitor and heal these replicated volumes.
|
|
We cannot do it only on the Cinder driver side because currently there is no
|
|
integrated mechanism to detect initiator connection events and carry out
|
|
replica replacement on the compute node.
|
|
|
|
For target-side volume replication (traditional approach), it is the storage
|
|
backend that takes care of monitoring and self healing.
|
|
The NVMe + MDRAID approach moves the data replication responsibility from the
|
|
storage backend to the consuming initiator (ie. compute node).
|
|
|
|
So the monitoring and healing needs to be on the initiator / compute side.
|
|
|
|
With this approach, the agent will monitor the NVMeoF connections and report
|
|
changes to the storage orchestrator / provisioner. It will monitor MDRAID arrays
|
|
and reconcile their physical state on the host with expected state from the
|
|
volume provisioner, replacing broken legs.
|
|
|
|
Finally, orchestration decisions / optimizations will be carried by the volume
|
|
orchestrator / provisioner using reported information from agent monitoring.
|
|
Though this is outside the scope of the agent (it is storage backend
|
|
implemented functionality) - it is useful to mention here that it will handle
|
|
cases such as avoid using faulty replicas during re-attachment scenarios,
|
|
because in this design approach only the initiator node can detect the
|
|
replicas' sync states of its MDRAID arrays.
|
|
|
|
|
|
Use Cases
|
|
=========
|
|
|
|
When working with replicated NVMeoF volumes that are attached to an instance
|
|
for a long time, one of the replicas may go faulty.
|
|
This agent will detect it and attempt to replace it, i.e., self heal the
|
|
MDRAID array, without the need to detach and re-attach the entire volume from
|
|
the instance.
|
|
|
|
Additionally, the agent will detect and report connection and replica sync
|
|
state events to the storage orchestrator (or potentially other endpoints
|
|
that can make use of it) - which is gathered by the storage backend for making
|
|
storage provisioning / orchestration decisions, as well as for telemetry.
|
|
|
|
|
|
Proposed change
|
|
===============
|
|
|
|
Add an agent entry point code to os-brick, such as:
|
|
`os_brick/os-brick/cmd/agent.py`
|
|
|
|
Add an `entry_points` `console_scripts` entry in os-brick's `setup.cfg`
|
|
|
|
The agent main function will first initialize the agent by reading access
|
|
information to the volume orchestor / provisioner from a pre-defined config
|
|
file (such as `/etc/nvme-agent/agent.conf` ?)
|
|
|
|
Vendor specific params will be used and prefixed by the vendor prefix, such as:
|
|
`kioxia_provisioner_ip`
|
|
`kioxia_provisioner_port`
|
|
`kioxia_provisioner_token`
|
|
`kioxia_cert_file`
|
|
|
|
|
|
Once initialized, the agent will start a periodic task that will do
|
|
the following:
|
|
|
|
- Host probe / heartbeat to the storage orchestrator / provisioner
|
|
- Pull volume metadata (connection and replication state) from provisioner
|
|
- Monitor NVMeoF devices and MDRAID arrays belonging to it
|
|
- Detect connection and replication state changes and report to provisioner
|
|
- Request replacements for faulty replicas
|
|
- Reconcile replica states from provisioner (carry out the replacements)
|
|
|
|
|
|
Typical self healing flow:
|
|
|
|
1. volume replica goes faulty
|
|
2. agent notices faulty replica, reports to provisioner
|
|
3. provisioner marks replica as bad (so it wont be used later unless synced)
|
|
4. agent keeps pulling volume information from provisioner
|
|
5. certain grace period passes, agent sees no state changes of faulty replica
|
|
from provisioner, so it sends explicit request to replace replica
|
|
6. provisioner replaces replica and updates volume information
|
|
7. agent pulls volume replica information, notices a replica has changed
|
|
8. agent carries out replica replacement
|
|
|
|
|
|
Alternatives
|
|
------------
|
|
|
|
Operator could use some own scripts to monitor connections and replicated
|
|
arrays states, report detected events, and carry out replica replacements
|
|
manually.
|
|
|
|
Data model impact
|
|
-----------------
|
|
|
|
None
|
|
|
|
REST API impact
|
|
---------------
|
|
|
|
None
|
|
|
|
Security impact
|
|
---------------
|
|
|
|
Sudo executions of `nvme` and `mdadm`
|
|
Needs access for reading of root filesystem paths such as:
|
|
`/sys/class/nvme-fabrics/...`
|
|
`/sys/class/block/...`
|
|
|
|
|
|
Active/Active HA impact
|
|
-----------------------
|
|
|
|
None
|
|
|
|
Notifications impact
|
|
--------------------
|
|
|
|
None
|
|
|
|
Other end user impact
|
|
---------------------
|
|
|
|
None
|
|
|
|
Performance Impact
|
|
------------------
|
|
|
|
If configured to run by the operator, this will be a new process running on
|
|
the compute node. Though it will spend most of its time sleeping, it will
|
|
wake up every 30 seconds to do its periodic tasks: probe the storage provisioner
|
|
and inspect nvme connections and mdraid states.
|
|
|
|
These tasks are not compute intensive, with time mostly spent waiting for a
|
|
response from the storage provisioner, and the nvme and mdraid operations will
|
|
only have time complexity linear to the number of devices under the control of
|
|
the agent (which can be treated as constant due to a low upper limit per host).
|
|
And finally, the performance effect on the network will also be small, since it
|
|
will only be sending/receiving small amounts of (meta)data across the network.
|
|
|
|
Other deployer impact
|
|
---------------------
|
|
|
|
None
|
|
|
|
Developer impact
|
|
----------------
|
|
|
|
To allow multiple vendor implementations, the specific methods / logic for:
|
|
|
|
- probing / heartbeating the storage provisioner
|
|
- pulling / parsing volume metadata from provisioner
|
|
- reporting state changes to provisioner
|
|
- requesting provisioner to replace replica
|
|
|
|
These all involve communication with and functionality carried out by the
|
|
storage backend provisioner / orchestrator, will need to be implemented on
|
|
a per vendor basis.
|
|
|
|
The architecture is such that the agent will be a generic daemon that will
|
|
define the interface, and the kioxia implementation will be the first
|
|
example of a vendor-specific implementation.
|
|
|
|
|
|
Implementation
|
|
==============
|
|
|
|
Assignee(s)
|
|
-----------
|
|
|
|
Zohar Mamedov
|
|
zoharm
|
|
|
|
Work Items
|
|
----------
|
|
|
|
Agent entry point and initialization.
|
|
|
|
Agent periodic tasks:
|
|
|
|
- Host probe / heart beat
|
|
- Monitoring (connection and replication event detection)
|
|
- Report events to storage provisioner
|
|
- Connection and replication state re-conciliation
|
|
|
|
|
|
Dependencies
|
|
============
|
|
|
|
None
|
|
|
|
|
|
Testing
|
|
=======
|
|
|
|
We should be able to accept this with just unit tests.
|
|
|
|
|
|
Documentation Impact
|
|
====================
|
|
|
|
Document that with this feature os-brick will be coming with a console-script
|
|
that is used to launch this agent.
|
|
|
|
Document how to configure the agent for usage.
|
|
|
|
|
|
References
|
|
==========
|
|
|
|
Presentation slides with architectural diagram on slide 2
|
|
https://docs.google.com/presentation/d/1lPU8mQ7jJmr9Tybu5gXkbE7NC1ppkMnoBS4cgSFhzWc
|