d6e0be38df
Implements: blueprint nvme-of-add-client-raid-1 Change-Id: Ibeebc62ec649933747b30537d3a6d4d84641fad0
251 lines
7.6 KiB
ReStructuredText
251 lines
7.6 KiB
ReStructuredText
..
|
|
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
|
License.
|
|
|
|
http://creativecommons.org/licenses/by/3.0/legalcode
|
|
|
|
==========================================
|
|
NVMe Connector Support MDRAID replication
|
|
==========================================
|
|
|
|
https://blueprints.launchpad.net/cinder/+spec/nvme-of-add-client-raid-1
|
|
|
|
NVMe connector replicated volume support via MDRAID.
|
|
Allow OpenStack to use replicated NVMe volumes that are distributed on
|
|
scale-out storage.
|
|
|
|
|
|
Problem description
|
|
===================
|
|
|
|
When consuming block storage, resilience of the block volumes is required.
|
|
This can be achieved on the storage backend with NVMe and iSCSI though the
|
|
target will remain a single point of failure. And with multipathing, the HA
|
|
of paths to a target does not handle volume data replication.
|
|
|
|
A storage solution exposing high performance NVMeoF storage would need a
|
|
way to handle volume replication. We propose achieving this by using RAID1
|
|
on the host, since the relative performance impact will be less significant
|
|
in such an environment and will be greatly outweighed by the benefits.
|
|
|
|
|
|
Use Cases
|
|
=========
|
|
|
|
When resiliency is needed for NVMeoF storage.
|
|
|
|
Taking advantage of NVMe over a high performance fabric, gain benefits of
|
|
replication on the host while maintaining good performance.
|
|
|
|
In this case volume replica failures will have no impact on the consumer,
|
|
and will allow for self healing to take place seamlessly.
|
|
|
|
For developers of volume drivers, this will allow support for OpenStack
|
|
to use replicated volumes that they would expose in their storage backend.
|
|
|
|
End users operating in this mode will benefit from performance of NVMe with
|
|
added resilience due to volume replication.
|
|
|
|
|
|
Proposed change
|
|
===============
|
|
|
|
Expand the NVMeoF connector in os-brick to be able to take in connection
|
|
information for replicated volumes.
|
|
|
|
When `connect_volume` is called with replicated volume information, NVMe
|
|
connect to all replica targets and create a RAID1 over the devices.
|
|
|
|
.. image:: https://wiki.openstack.org/w/images/a/ab/Nvme-of-add-client-raid1-detail.png
|
|
:width: 448
|
|
:alt: NVMe client RAID diagram
|
|
|
|
Alternatives
|
|
------------
|
|
|
|
Replication can also be actively maintained on the backend side, which is
|
|
the way it is commonly done. However, with NVMe on a high performance fabric,
|
|
the benefits of handling replication on the consumer host can outweigh the
|
|
performance costs.
|
|
|
|
Adding support for this also increases the type of backends we fully support,
|
|
as not all vendors will chose supporting replication on the backend side.
|
|
|
|
And we can support both without any kind of impact on each one.
|
|
|
|
Data model impact
|
|
-----------------
|
|
|
|
NVMe Connector's `connect_volume` and `disconnect_volume` param
|
|
`connection_properties` can now also hold a `volume_replicas` list, which will
|
|
contain necessary info for connecting to and identifying the NVMe subsystems
|
|
for then doing MDRAID replication over them.
|
|
|
|
Each replica dict in `volume_replicas` list must include the normal flat
|
|
connection properties.
|
|
|
|
If `volume_replicas` has only one replica, treat it as non-replicated volume.
|
|
|
|
If `volume_replicas` is ommitted, use the normal flat connection properties
|
|
for NVMe as in the existing version of this NVMe connector.
|
|
|
|
non-replicated volume connection properties::
|
|
|
|
{'uuid': '96e25fb4-9f91-4c88-ab59-275fd354777e',
|
|
'nqn': 'nqn.2014-08.org.nvmexpress:uuid:...'
|
|
'portals': [{
|
|
'address': '10.0.0.101',
|
|
'port': '4420',
|
|
'transport': 'tcp'
|
|
}],
|
|
'volume_replicas': None}
|
|
|
|
replicated volume connection properties::
|
|
|
|
{'volume_replicas': [
|
|
{'uuid': '96e25fb4-9f91-4c88-ab59-275fd354777e',
|
|
'nqn': 'nqn.2014-08.org.nvmexpress:uuid:...'
|
|
'portals': [{
|
|
'address': '10.0.0.101',
|
|
'port': '4420',
|
|
'transport': 'tcp'
|
|
}]},
|
|
{'uuid': '12345fb4-9f91-4c88-ab59-275fd3544321',
|
|
'nqn': 'nqn.2014-08.org.nvmexpress:uuid:...'
|
|
'portals': [{
|
|
'address': '10.0.0.110',
|
|
'port': '4420',
|
|
'transport': 'tcp'
|
|
}]},
|
|
]}
|
|
|
|
REST API impact
|
|
---------------
|
|
|
|
None
|
|
|
|
Security impact
|
|
---------------
|
|
|
|
Requires elevated priviliges for managing MDRAID.
|
|
(Current NVMe connector already does sudo executions of nvme cli, so this
|
|
change will just add execution of `mdadm`)
|
|
|
|
|
|
Active/Active HA impact
|
|
-----------------------
|
|
|
|
None
|
|
|
|
|
|
Notifications impact
|
|
--------------------
|
|
|
|
None
|
|
|
|
Other end user impact
|
|
---------------------
|
|
|
|
Working in this replicated mode will allow for special case scenarios where for
|
|
example a MDRAID array with 4 replicas loses connection to two of the replicas,
|
|
keeps writing data to two remaining ones. Then, after a re-attach from a reboot
|
|
or a migration, for some reason now has access to only the two originally
|
|
lost replicas, and not the two "good" ones, then the re-created MDRAID array
|
|
will have old / bad data.
|
|
|
|
The above can be remedied by storage backend awareness of devices going faulty
|
|
in the array. This is enabled by the NVMe monitoring agent, which can recognize
|
|
replicas going faulty in an array and notify the storage backend, which will
|
|
mark these replicas as faulty for the replicated volume.
|
|
|
|
Multi attach is not supported for NVMe MDRAID volumes.
|
|
|
|
Performance Impact
|
|
------------------
|
|
|
|
Replicated volume attachments will be slower (need to build MDRAID array).
|
|
It's a fair tradeoff, slower attachment for more resiliency.
|
|
|
|
Other deployer impact
|
|
---------------------
|
|
|
|
NVMe and MDRAID and their CLI clients (`nvme` and `mdadm`) need to be
|
|
available on the hosts for NVMe connections and RAID replication respectively.
|
|
|
|
Developer impact
|
|
----------------
|
|
|
|
Gives option for storage vendors to support replicated NVMeoF volumes via
|
|
their driver.
|
|
|
|
To use this feature, volume drivers will need to expose NVMe storage that is
|
|
replicated and provide necessary connection information for it when using this
|
|
feature of the connector.
|
|
|
|
This would not affect non-replicated volumes.
|
|
|
|
|
|
Implementation
|
|
==============
|
|
|
|
Assignee(s)
|
|
-----------
|
|
|
|
Zohar Mamedov
|
|
zoharm
|
|
|
|
Work Items
|
|
----------
|
|
|
|
All done in NVMe connector:
|
|
|
|
- In `connect_volume` parse connection information for replicated volumes.
|
|
- Connect to NVMeoF targets and identify the devices.
|
|
- Create MD RAID1 array over devices.
|
|
- Return symlink to MDRAID device.
|
|
- `disconnect_volume` destroy the MDRAID.
|
|
- `extend_volume` grow the MDRAID.
|
|
|
|
|
|
Dependencies
|
|
============
|
|
|
|
NVMe and MDRAID and their CLI clients (`nvme` and `mdadm`) need to be
|
|
available on the hosts for NVMe connections and RAID replication respectively.
|
|
Fail gracefully if they are not found.
|
|
|
|
Testing
|
|
=======
|
|
|
|
In order to properly test this in tempest, programmatic access will be needed
|
|
to the storage backend. For example, to fail one of the drives of a replicated
|
|
volume.
|
|
|
|
We could also slide by with just a check of connected NVMe subsystems
|
|
(`nvme list`) and scan of MDRAID arrays (`mdadm -D scan`) to see that multiple
|
|
NVMe devices were connected and a RAID was created.
|
|
|
|
In either case tempest will need to be aware that the storage backend is
|
|
configured to use replicated NVMe volumes and only then do these checks.
|
|
|
|
Aside from that, running tempest with NVMe replicated volume backend
|
|
will still fully test this functionality, it's just the specific assertions
|
|
(nvme devices and RAID info) that would be different.
|
|
|
|
Finally, we will start with unit tests, and have functional tests in os-brick
|
|
as a stretch goal.
|
|
|
|
Documentation Impact
|
|
====================
|
|
|
|
Document that NVMe connector will now support replicated volumes and that
|
|
connection information of replicas is required from the volume driver to
|
|
support it.
|
|
|
|
|
|
References
|
|
==========
|
|
|
|
Architectural diagram
|
|
https://wiki.openstack.org/wiki/File:Nvme-of-add-client-raid1-detail.png
|