From 6a3ab81771c1166abb552cd8e69460417462ad90 Mon Sep 17 00:00:00 2001 From: Zohar Mamedov Date: Wed, 27 Oct 2021 17:07:03 +0700 Subject: [PATCH] Os-brick NVMeoF connector multipathing spec Native multipathing DeviceMapper "multipathing" / sync replication Interaction between these and connector RAID feature Implements: blueprint nvmeof-multipath Change-Id: I863238f619cb809df7243e6d4317f6ff64c0e883 --- specs/yoga/nvme-multipath.rst | 359 ++++++++++++++++++++++++++++++++++ 1 file changed, 359 insertions(+) create mode 100644 specs/yoga/nvme-multipath.rst diff --git a/specs/yoga/nvme-multipath.rst b/specs/yoga/nvme-multipath.rst new file mode 100644 index 00000000..9e3b5eda --- /dev/null +++ b/specs/yoga/nvme-multipath.rst @@ -0,0 +1,359 @@ +.. + This work is licensed under a Creative Commons Attribution 3.0 Unported + License. + + http://creativecommons.org/licenses/by/3.0/legalcode + +=================== +NVMeoF Multipathing +=================== + +https://blueprints.launchpad.net/cinder/+spec/nvmeof-multipath + +Extend the NVMeoF connector to support two types of multipathing: + +* Native NVMeoF multipathing +* DeviceMapper based multipathing + +At the time of writing, only TCP and RDMA protocols are supported and +contemplated by the spec (so it does not contemplate the case for other +transport protocols such as FC). + + +Problem description +=================== + +Currently the NVMeoF connector only supports single pathed connections. + +The problem description can be summed up by the statement: +A single pathed connection is a single point of failure. + +For resiliency, and to catch up with industry standards, support for +multi pathed connections is needed, mitigating the single point of failure. + +Another potential benefit is performance while maintaining replication, such +as in cases where there is synchronous replication on the backend, and the +multipath mechanism can switch on failure. + +This can be done in either of two ways: + +NVMeoF ANA (Asynchronous Namespace Access) is the NVMeoF native way of doing +multipathing, and it offers an interesting set of features. +A spec is available for more under-the-hood details [1]_. + +Using DeviceMapper for multipathing is an industry standard approach (while +NVMeoF ANA support is not yet) with connections such as iSCSI, and can be +equally utilized with NVMeoF connections. +The goal of DM component of this spec's implementation is to bring the NVMeoF +connector up to speed with using DeviceMapper for multipathed connections. + + +Use Cases +========= + +Whether via native multipath or DeviceMapper, the use case is the ability to +expose multipathed connections to volumes, mitigating connection single point +of failure and improving performance. + +For example, an NVMeoF volume has two paths (portals) and a multipath +connection is established to it. If one of the paths breaks (such as in a +partial network failure) - the attached device would remain functional, +whereas with single pathed connections, the device would become unavailable +to the host it was attached to. + + +Proposed change +=============== + +The multipathing implementation will be broken up into its two main component +features - Native and DM multipathing: + +Phase 1 (Native) +---------------- + +The NVMeoF connector will check the host's multipath capability and report +it in the connector properties passed to the volume driver, this will also +allow any driver / backend specific handling. + +The value of `/sys/module/nvme_core/parameters/multipath` can be `Y` or `N` - +this should be returned as a True or False `nvme_native_multipath` property in +`get_connector_properties` method from the `NVMeOFConnector` connector class. + +The "new" NVMeoF connector API (introduced in Wallaby) supports passing +multiple portals for the same NVMeoF device. Currently only the first +successful connection is established. + +In multipath mode, the portals connection method should simply connect to +all portals by running the `nvme connect` cli command for each one. + +The NVMe system under the hood will create invisible devices for each of the +portal connections, and expose a single multipath device on top of them. + +It should be noted that the native multipath spec requires the target portals +to expose the device under the same target NQN and namespace id for multipath +to happen. (This is the backend's responsibility and should be transparent to +OpenStack code) + +Finally, the new device will be matched by its uuid as is already implemented +in the baseline single path connector. + +The connection information for a native multipath volume will look like: + +.. code-block:: python + + { + 'vol_uuid': '...someuuid...', + 'target_nqn': '...somenqn...', + 'portals': [ + ('10.1.1.1', 4420, 'tcp'), + ('10.1.1.2', 4420, 'tcp') + ] + } + +Regarding `enforce_multipath` - it is a DeviceMapper multipathing parameter +and it will not have any interaction with the native NVMe multipathing. + +Phase 2 (DeviceMapper) +---------------------- + +For DeviceMapper multipathing, use `dm_replicas` in the connection information +to specify all the replicas. This approach allows for multiple NQNs for one +synced volume, which is often a requirement for DM-based multipathing. + +`dm_replicas` cannot be used together with `volume_replicas` which is for RAID +replication only. The drivers should know not to pass that. If both are +passed, the operation should fail. + +As in the baseline connector, connect to all `dm_replicas` (it is worth +mentioning that each replica could have multiple portals, essentially allowing +this DeviceMapper aggregation to run on top of native multipath devices) + +Once connections are established, DeviceMapper will handle the multipathing +under the hood (should be transparent to OpenStack code) + +DeviceMapper needs to recognize that two devices should be "merged" +It does so based on device id information +Therefore, it is the responsibility of the backend to expose devices with +id information that conforms to the device matching requirements +of DeviceMapper. + +Since DeviceMapper recognizes two devices should be "merged" based on device +id information (udev?) - it is the responsiblity of the backend to handle its +exposed devices to conform to the requirement. + +The connection information for a DeviceMapper multipath volume will look like: + +.. code-block:: python + + { + 'vol_uuid': '...someuuid...', + 'dm_replicas': [ + { + 'vol_uuid': '...someuuid1...', + 'target_nqn': '...somenqn1...', + 'portals': [('10.1.1.1', 4420, 'tcp')] + }, + { + 'vol_uuid': '...someuuid2...', + 'target_nqn': '...somenqn2...', + 'portals': [('10.1.1.2', 4420, 'tcp')] + } + ] + } + + +Connector API Summary +--------------------- + +With this proposal, there will be multiple replication and/or multipath mode +permutations of operation for the connector: +- Single pathed connection +- Native multipathed connection +- RAID connection (single or multipathed) +- DeviceMapper connection (single or native multipath) + +Note: RAID on top of DM is not allowed as it is inconsistent with the +connection information API (and would need an unnecessary overcomplication of +it as well as the physical implementation). It could theoretically be added +one day based on need, but it is overkill on multiple levels. + +Single path connection information for cinder driver: + +.. code-block:: python + + { + 'vol_uuid': '...someuuid...', + 'target_nqn': '...somenqn...', + 'portals': [('10.1.1.1', 4420, 'tcp')] + } + +Connection information for cinder driver that supports native multipathed +connections (os-brick will iterate them if a single path connection is +requested): + +.. code-block:: python + + { + 'vol_uuid': '...someuuid...', + 'target_nqn': '...somenqn...', + 'portals': [ + ('10.1.1.1', 4420, 'tcp'), + ('10.1.1.2', 4420, 'tcp') + ] + } + +For DeviceMapper, the example below is given with no native multipath for the +replica devices. For these replica devices to do native multipathing, portals +for each of each device's paths should be specified in the `portals` list. + +Connection information for DeviceMapper: + +.. code-block:: python + + { + 'vol_uuid': '...someuuid...', + 'dm_replicas': [ + { + 'vol_uuid': '...someuuid1...', + 'target_nqn': '...somenqn1...', + 'portals': [('10.1.1.1', 4420, 'tcp')] + }, + { + 'vol_uuid': '...someuuid2...', + 'target_nqn': '...somenqn2...', + 'portals': [('10.1.1.2', 4420, 'tcp')] + } + ] + } + + +Alternatives +============ + +Both alternative for NVMeoF multipathing that I am aware of are described +in this spec: Native and DeviceMapper + +Data model impact +================= + +None + +REST API impact +=============== + +None + +Security impact +=============== + +Same as the baseline NVMeoF connector: + +* Sudo executions of `nvme` +* | Needs access for reading of root filesystem paths such as: + | `/sys/class/nvme-fabrics/...` + | `/sys/class/block/...` + + +Active/Active HA impact +======================= + +None + +Notifications impact +==================== + +None + +Other end user impact +===================== + +None + +Performance Impact +================== + +The main benefit is of increased I/O while maintaining replication - such as +in a synchronous replication active-passive failover multipath configuration. + +One performance impact should happen during attachments on the host, where +now multiple NVMeoF devices (rather than just one) would need to be connected, +and a new device exposed on top of them. + +Depending on the mode of multipath connection (for example active-active vs +active-passive) - there could be an increase in network resource usage for +the multipath device io. + + +Other deployer impact +===================== + +None + +Developer impact +================ + +Driver developers that want to use NVMeoF will need to be aware of the +connector API - specifically the connection_information spec described + +It is also worth mentioning that storage backends (although transparent to +OpenStack code) will need to keep in mind the system level requirements +for whichever multipath mode they use (Native or DeviceMapper) + + +Implementation +============== + +Assignee(s) +----------- + +* Zohar Mamedov (zoharm) +* Anyone Else + +Work Items +---------- + +Phase 1 (Native): + +* Connector properties check kernel supports native multipath + +* Establish all (multiple) necessary connections to multipath volume and ensure + proper device is exposed. + + +Phase 2 (DeviceMapper): + +* Ensure all sync replicated devices are connected. + +* Ensure the proper DeviceMapper device is exposed. + + +Dependencies +============ + +Baseline dependencies are same as baseline single pathed connector: +relevant nvmeof kernel modules, and nvme-cli + +For native NVMe multipathing, the kernel needs to support it + +For DeviceMapper multipathing, DeviceMapper needs to be operational + + +Testing +======= + +Unit tests. +Also there is an upcoming gate job that will use LVM to test NVMeoF (where +multipath testing for it can be included there) + + +Documentation Impact +==================== + +Document the new connection_information API described in this spec - especially +for drivers to be able to follow and use it. + + +References +========== + +.. [1] ANA Spec - Section 8.1.3: https://nvmexpress.org/wp-content/uploads/NVMe-NVM-Express-2.0a-2021.07.26-Ratified.pdf