NVMe connector support MD replication spec.
Implements: blueprint nvme-of-add-client-raid-1 Change-Id: Ibeebc62ec649933747b30537d3a6d4d84641fad0
This commit is contained in:
parent
30466d896b
commit
d6e0be38df
250
specs/wallaby/nvme-connector-md-support.rst
Normal file
250
specs/wallaby/nvme-connector-md-support.rst
Normal file
@ -0,0 +1,250 @@
|
|||||||
|
..
|
||||||
|
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||||
|
License.
|
||||||
|
|
||||||
|
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||||
|
|
||||||
|
==========================================
|
||||||
|
NVMe Connector Support MDRAID replication
|
||||||
|
==========================================
|
||||||
|
|
||||||
|
https://blueprints.launchpad.net/cinder/+spec/nvme-of-add-client-raid-1
|
||||||
|
|
||||||
|
NVMe connector replicated volume support via MDRAID.
|
||||||
|
Allow OpenStack to use replicated NVMe volumes that are distributed on
|
||||||
|
scale-out storage.
|
||||||
|
|
||||||
|
|
||||||
|
Problem description
|
||||||
|
===================
|
||||||
|
|
||||||
|
When consuming block storage, resilience of the block volumes is required.
|
||||||
|
This can be achieved on the storage backend with NVMe and iSCSI though the
|
||||||
|
target will remain a single point of failure. And with multipathing, the HA
|
||||||
|
of paths to a target does not handle volume data replication.
|
||||||
|
|
||||||
|
A storage solution exposing high performance NVMeoF storage would need a
|
||||||
|
way to handle volume replication. We propose achieving this by using RAID1
|
||||||
|
on the host, since the relative performance impact will be less significant
|
||||||
|
in such an environment and will be greatly outweighed by the benefits.
|
||||||
|
|
||||||
|
|
||||||
|
Use Cases
|
||||||
|
=========
|
||||||
|
|
||||||
|
When resiliency is needed for NVMeoF storage.
|
||||||
|
|
||||||
|
Taking advantage of NVMe over a high performance fabric, gain benefits of
|
||||||
|
replication on the host while maintaining good performance.
|
||||||
|
|
||||||
|
In this case volume replica failures will have no impact on the consumer,
|
||||||
|
and will allow for self healing to take place seamlessly.
|
||||||
|
|
||||||
|
For developers of volume drivers, this will allow support for OpenStack
|
||||||
|
to use replicated volumes that they would expose in their storage backend.
|
||||||
|
|
||||||
|
End users operating in this mode will benefit from performance of NVMe with
|
||||||
|
added resilience due to volume replication.
|
||||||
|
|
||||||
|
|
||||||
|
Proposed change
|
||||||
|
===============
|
||||||
|
|
||||||
|
Expand the NVMeoF connector in os-brick to be able to take in connection
|
||||||
|
information for replicated volumes.
|
||||||
|
|
||||||
|
When `connect_volume` is called with replicated volume information, NVMe
|
||||||
|
connect to all replica targets and create a RAID1 over the devices.
|
||||||
|
|
||||||
|
.. image:: https://wiki.openstack.org/w/images/a/ab/Nvme-of-add-client-raid1-detail.png
|
||||||
|
:width: 448
|
||||||
|
:alt: NVMe client RAID diagram
|
||||||
|
|
||||||
|
Alternatives
|
||||||
|
------------
|
||||||
|
|
||||||
|
Replication can also be actively maintained on the backend side, which is
|
||||||
|
the way it is commonly done. However, with NVMe on a high performance fabric,
|
||||||
|
the benefits of handling replication on the consumer host can outweigh the
|
||||||
|
performance costs.
|
||||||
|
|
||||||
|
Adding support for this also increases the type of backends we fully support,
|
||||||
|
as not all vendors will chose supporting replication on the backend side.
|
||||||
|
|
||||||
|
And we can support both without any kind of impact on each one.
|
||||||
|
|
||||||
|
Data model impact
|
||||||
|
-----------------
|
||||||
|
|
||||||
|
NVMe Connector's `connect_volume` and `disconnect_volume` param
|
||||||
|
`connection_properties` can now also hold a `volume_replicas` list, which will
|
||||||
|
contain necessary info for connecting to and identifying the NVMe subsystems
|
||||||
|
for then doing MDRAID replication over them.
|
||||||
|
|
||||||
|
Each replica dict in `volume_replicas` list must include the normal flat
|
||||||
|
connection properties.
|
||||||
|
|
||||||
|
If `volume_replicas` has only one replica, treat it as non-replicated volume.
|
||||||
|
|
||||||
|
If `volume_replicas` is ommitted, use the normal flat connection properties
|
||||||
|
for NVMe as in the existing version of this NVMe connector.
|
||||||
|
|
||||||
|
non-replicated volume connection properties::
|
||||||
|
|
||||||
|
{'uuid': '96e25fb4-9f91-4c88-ab59-275fd354777e',
|
||||||
|
'nqn': 'nqn.2014-08.org.nvmexpress:uuid:...'
|
||||||
|
'portals': [{
|
||||||
|
'address': '10.0.0.101',
|
||||||
|
'port': '4420',
|
||||||
|
'transport': 'tcp'
|
||||||
|
}],
|
||||||
|
'volume_replicas': None}
|
||||||
|
|
||||||
|
replicated volume connection properties::
|
||||||
|
|
||||||
|
{'volume_replicas': [
|
||||||
|
{'uuid': '96e25fb4-9f91-4c88-ab59-275fd354777e',
|
||||||
|
'nqn': 'nqn.2014-08.org.nvmexpress:uuid:...'
|
||||||
|
'portals': [{
|
||||||
|
'address': '10.0.0.101',
|
||||||
|
'port': '4420',
|
||||||
|
'transport': 'tcp'
|
||||||
|
}]},
|
||||||
|
{'uuid': '12345fb4-9f91-4c88-ab59-275fd3544321',
|
||||||
|
'nqn': 'nqn.2014-08.org.nvmexpress:uuid:...'
|
||||||
|
'portals': [{
|
||||||
|
'address': '10.0.0.110',
|
||||||
|
'port': '4420',
|
||||||
|
'transport': 'tcp'
|
||||||
|
}]},
|
||||||
|
]}
|
||||||
|
|
||||||
|
REST API impact
|
||||||
|
---------------
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
Security impact
|
||||||
|
---------------
|
||||||
|
|
||||||
|
Requires elevated priviliges for managing MDRAID.
|
||||||
|
(Current NVMe connector already does sudo executions of nvme cli, so this
|
||||||
|
change will just add execution of `mdadm`)
|
||||||
|
|
||||||
|
|
||||||
|
Active/Active HA impact
|
||||||
|
-----------------------
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
|
||||||
|
Notifications impact
|
||||||
|
--------------------
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
Other end user impact
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
Working in this replicated mode will allow for special case scenarios where for
|
||||||
|
example a MDRAID array with 4 replicas loses connection to two of the replicas,
|
||||||
|
keeps writing data to two remaining ones. Then, after a re-attach from a reboot
|
||||||
|
or a migration, for some reason now has access to only the two originally
|
||||||
|
lost replicas, and not the two "good" ones, then the re-created MDRAID array
|
||||||
|
will have old / bad data.
|
||||||
|
|
||||||
|
The above can be remedied by storage backend awareness of devices going faulty
|
||||||
|
in the array. This is enabled by the NVMe monitoring agent, which can recognize
|
||||||
|
replicas going faulty in an array and notify the storage backend, which will
|
||||||
|
mark these replicas as faulty for the replicated volume.
|
||||||
|
|
||||||
|
Multi attach is not supported for NVMe MDRAID volumes.
|
||||||
|
|
||||||
|
Performance Impact
|
||||||
|
------------------
|
||||||
|
|
||||||
|
Replicated volume attachments will be slower (need to build MDRAID array).
|
||||||
|
It's a fair tradeoff, slower attachment for more resiliency.
|
||||||
|
|
||||||
|
Other deployer impact
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
NVMe and MDRAID and their CLI clients (`nvme` and `mdadm`) need to be
|
||||||
|
available on the hosts for NVMe connections and RAID replication respectively.
|
||||||
|
|
||||||
|
Developer impact
|
||||||
|
----------------
|
||||||
|
|
||||||
|
Gives option for storage vendors to support replicated NVMeoF volumes via
|
||||||
|
their driver.
|
||||||
|
|
||||||
|
To use this feature, volume drivers will need to expose NVMe storage that is
|
||||||
|
replicated and provide necessary connection information for it when using this
|
||||||
|
feature of the connector.
|
||||||
|
|
||||||
|
This would not affect non-replicated volumes.
|
||||||
|
|
||||||
|
|
||||||
|
Implementation
|
||||||
|
==============
|
||||||
|
|
||||||
|
Assignee(s)
|
||||||
|
-----------
|
||||||
|
|
||||||
|
Zohar Mamedov
|
||||||
|
zoharm
|
||||||
|
|
||||||
|
Work Items
|
||||||
|
----------
|
||||||
|
|
||||||
|
All done in NVMe connector:
|
||||||
|
|
||||||
|
- In `connect_volume` parse connection information for replicated volumes.
|
||||||
|
- Connect to NVMeoF targets and identify the devices.
|
||||||
|
- Create MD RAID1 array over devices.
|
||||||
|
- Return symlink to MDRAID device.
|
||||||
|
- `disconnect_volume` destroy the MDRAID.
|
||||||
|
- `extend_volume` grow the MDRAID.
|
||||||
|
|
||||||
|
|
||||||
|
Dependencies
|
||||||
|
============
|
||||||
|
|
||||||
|
NVMe and MDRAID and their CLI clients (`nvme` and `mdadm`) need to be
|
||||||
|
available on the hosts for NVMe connections and RAID replication respectively.
|
||||||
|
Fail gracefully if they are not found.
|
||||||
|
|
||||||
|
Testing
|
||||||
|
=======
|
||||||
|
|
||||||
|
In order to properly test this in tempest, programmatic access will be needed
|
||||||
|
to the storage backend. For example, to fail one of the drives of a replicated
|
||||||
|
volume.
|
||||||
|
|
||||||
|
We could also slide by with just a check of connected NVMe subsystems
|
||||||
|
(`nvme list`) and scan of MDRAID arrays (`mdadm -D scan`) to see that multiple
|
||||||
|
NVMe devices were connected and a RAID was created.
|
||||||
|
|
||||||
|
In either case tempest will need to be aware that the storage backend is
|
||||||
|
configured to use replicated NVMe volumes and only then do these checks.
|
||||||
|
|
||||||
|
Aside from that, running tempest with NVMe replicated volume backend
|
||||||
|
will still fully test this functionality, it's just the specific assertions
|
||||||
|
(nvme devices and RAID info) that would be different.
|
||||||
|
|
||||||
|
Finally, we will start with unit tests, and have functional tests in os-brick
|
||||||
|
as a stretch goal.
|
||||||
|
|
||||||
|
Documentation Impact
|
||||||
|
====================
|
||||||
|
|
||||||
|
Document that NVMe connector will now support replicated volumes and that
|
||||||
|
connection information of replicas is required from the volume driver to
|
||||||
|
support it.
|
||||||
|
|
||||||
|
|
||||||
|
References
|
||||||
|
==========
|
||||||
|
|
||||||
|
Architectural diagram
|
||||||
|
https://wiki.openstack.org/wiki/File:Nvme-of-add-client-raid1-detail.png
|
Loading…
Reference in New Issue
Block a user