Merge "Replace OSDs on an AIO-DX System (pick)"

2021-11-24 19:35:43 +00:00 · 2021-11-24 19:35:43 +00:00 · 10550afed2
commit 10550afed2
parent b8fd6b1937 3e51eb1421
3 changed files with 133 additions and 5 deletions
--- a/doc/source/storage/kubernetes/index.rst
+++ b/doc/source/storage/kubernetes/index.rst
@ -119,9 +119,10 @@ Configure Ceph OSDs on a Host
   add-ssd-backed-journals-using-horizon
   add-ssd-backed-journals-using-the-cli
   add-a-storage-tier-using-the-cli
-   replace-osds-and-journal-disks
   provision-storage-on-a-controller-or-storage-host-using-horizon
   provision-storage-on-a-storage-host-using-the-cli
+   replace-osds-and-journal-disks
+   replace-osds-on-an-aio-dx-system-319b0bc2f7e6

 -------------------------
 Persistent Volume Support
--- a/doc/source/storage/kubernetes/replace-osds-and-journal-disks.rst
+++ b/doc/source/storage/kubernetes/replace-osds-and-journal-disks.rst
@ -13,8 +13,18 @@ You can replace failed storage devices on storage nodes.
 For best results, ensure the replacement disk is the same size as others in
 the same peer group. Do not substitute a smaller disk than the original.

-The replacement disk is automatically formatted and updated with data when the
-storage host is unlocked. For more information, see |node-doc|: :ref:`Change
-Hardware Components for a Storage Host
-<changing-hardware-components-for-a-storage-host>`.
+.. note::
+    Due to a limitation in **udev**, the device path of a disk connected through
+    a SAS controller changes when the disk is replaced. Therefore, in the
+    general procedure below, you must lock, delete, and re-install the node.
+    However, for an |AIO-DX| system, use the following alternative procedure to
+    replace |OSDs| without reinstalling the host:
+    :ref:`Replace OSDs on an AIO-DX System <replace-osds-on-an-aio-dx-system-319b0bc2f7e6>`.

+.. rubric:: |proc|
+
+Follow the procedure located at |node-doc|: :ref:`Change
+Hardware Components for a Storage Host <changing-hardware-components-for-a-storage-host>`.
+
+The replacement disk is automatically formatted and updated with data when the
+storage host is unlocked.
--- a/doc/source/storage/kubernetes/replace-osds-on-an-aio-dx-system-319b0bc2f7e6.rst
+++ b/doc/source/storage/kubernetes/replace-osds-on-an-aio-dx-system-319b0bc2f7e6.rst
@ -0,0 +1,117 @@
+.. _replace-osds-on-an-aio-dx-system-319b0bc2f7e6:
+
+================================
+Replace OSDs on an AIO-DX System
+================================
+
+On systems that use a Ceph backend for persistent storage, you can replace
+storage disks or swap an |AIO-DX| node while the system is running, even if the
+storage resources are in active use.
+
+.. note::
+    All storage alarms need to be cleared before starting this procedure.
+
+.. rubric:: |context|
+
+You can replace |OSDs| in an |AIO-DX| system to increase capacity, or replace
+faulty disks on the host without reinstalling the host.
+
+.. rubric:: |proc|
+
+#.  Ensure that the controller with the |OSD| to be replaced is the standby
+    controller.
+
+    For example, if the disk replacement has to be done on controller-1
+    and it is the active controller, use the following command to swact the
+    controller to controller-0:
+
+    .. code-block:: none
+
+        ~(keystone_admin)$ system host-show controller-1 | fgrep capabilities
+        ~(keystone_admin)$ system host-swact controller-1
+
+    After controller swact, you will have to connect via ssh again to the
+    <oam-floating-ip> to connect to the newly active controller-0.
+
+#.  Determine the **osdid** of the disk that is to be replaced.
+
+    .. code-block:: none
+
+        ~(keystone_admin)$ system host-stor-list controller-1
+
+#.  Lock the standby controller-1 to make the changes.
+
+    .. code-block:: none
+
+        ~(keystone_admin)$ system host-lock controller-1
+
+#.  Run the :command:`ceph osd destroy osd.<ID> --yes-i-really-mean-it` command.
+
+    .. code-block:: none
+
+        ~(keystone_admin)$ ceph osd destroy osd.<id> --yes-i-really-mean-it
+
+#.  Power down controller-1.
+
+#.  Replace the storage disk.
+
+#.  Power on controller-1.
+
+#.  Unlock controller-1.
+
+    .. code-block:: none
+
+        ~(keystone_admin)]$ system host-unlock controller-1
+
+#.  Wait for the recovery process in the Ceph cluster to complete.
+
+    .. code-block:: none
+
+        ~(keystone_admin)]$ ceph -s
+
+        cluster:
+          id:     50ce952f-bd16-4864-9487-6c7e959be95e
+          health: HEALTH_WARN
+          Degraded data redundancy: 13/50 objects degraded (26.000%), 10 pgs degraded
+
+        services:
+          mon: 1 daemons, quorum controller (age 68m)
+          mgr: controller-0(active, since 66m)
+          mds: kube-cephfs:1 {0=controller-0=up:active} 1 up:standby
+          osd: 2 osds: 2 up (since 9s), 2 in (since 9s)
+
+        data:
+          pools:   3 pools, 192 pgs
+          objects: 25 objects, 300 MiB
+          usage:   655 MiB used, 15 GiB / 16 GiB avail
+          pgs:     13/50 objects degraded (26.000%)
+                   182 active+clean
+                   8   active+recovery_wait+degraded
+                   2   active+recovering+degraded
+
+        io:
+          recovery: 24 B/s, 1 keys/s, 1 objects/s
+
+#.  Ensure that the Ceph cluster is healthy.
+
+    .. code-block:: none
+
+        ~(keystone_admin)]$ ceph -s
+
+        cluster:
+          id:     50ce952f-bd16-4864-9487-6c7e959be95e
+          health: HEALTH_OK
+
+        services:
+          mon: 1 daemons, quorum controller (age 68m)
+          mgr: controller-0(active, since 66m), standbys: controller-1
+          mds: kube-cephfs:1 {0=controller-0=up:active} 1 up:standby
+          osd: 2 osds: 2 up (since 36s), 2 in (since 36s)
+
+        data:
+          pools:   3 pools, 192 pgs
+          objects: 25 objects, 300 MiB
+          usage:   815 MiB used, 15 GiB / 16 GiB avail
+          pgs:     192 active+clean
+
+