Docs: Update ceph documentation

- Adding section for Ceph troubleshoot - Rearrange Testing section to include Ceph Co-Authored-By: portdirect <pete@port.direct> Change-Id: Ib04e9b59fea2557cf6cad177dfcc76390c161e06 Signed-off-by: Pete Birley <pete@port.direct>
2018-05-04 11:36:39 -07:00 · 2018-05-04 11:36:39 -07:00 · 5a9c8d41b7
commit 5a9c8d41b7
parent 79128a94fc
11 changed files with 610 additions and 7 deletions
--- a/doc/source/index.rst
+++ b/doc/source/index.rst
@ -17,7 +17,7 @@ Contents:
   install/index
   readme
   specs/index
-   testing
+   testing/index
   troubleshooting/index
 Indices and Tables
--- a/doc/source/testing/ceph-resiliency/README.rst
+++ b/doc/source/testing/ceph-resiliency/README.rst
@ -0,0 +1,25 @@
 ========================================
 Resiliency Tests for OpenStack-Helm/Ceph
 ========================================
 Mission
 =======
 The goal of our resiliency tests for `OpenStack-Helm/Ceph
 <https://github.com/openstack/openstack-helm/tree/master/ceph>`_ is to
 show symptoms of software/hardware failure and provide the solutions.
 Caveats:
   - Our focus lies on resiliency for various failure scenarios but
     not on performance or stress testing.
 Software Failure
 ================
 * `Monitor failure <./monitor-failure.html>`_
 * `OSD failure <./osd-failure.html>`_
 Hardware Failure
 ================
 * `Disk failure <./disk-failure.html>`_
 * `Host failure <./host-failure.html>`_
--- a/doc/source/testing/ceph-resiliency/disk-failure.rst
+++ b/doc/source/testing/ceph-resiliency/disk-failure.rst
@ -0,0 +1,171 @@
 ============
 Disk Failure
 ============
 Test Environment
 ================
 - Cluster size: 4 host machines
 - Number of disks: 24 (= 6 disks per host * 4 hosts)
 - Kubernetes version: 1.10.5
 - Ceph version: 12.2.3
 - OpenStack-Helm commit: 25e50a34c66d5db7604746f4d2e12acbdd6c1459
 Case: A disk fails
 ==================
 Symptom:
 --------
 This is to test a scenario when a disk failure happens.
 We monitor the ceph status and notice one OSD (osd.2) on voyager4
 which has ``/dev/sdh`` as a backend is down.
 .. code-block:: console
  (mon-pod):/# ceph -s
    cluster:
      id:     9d4d8c61-cf87-4129-9cef-8fbf301210ad
      health: HEALTH_WARN
              too few PGs per OSD (23 < min 30)
              mon voyager1 is low on available space
    services:
      mon: 3 daemons, quorum voyager1,voyager2,voyager3
      mgr: voyager1(active), standbys: voyager3
      mds: cephfs-1/1/1 up  {0=mds-ceph-mds-65bb45dffc-cslr6=up:active}, 1 up:standby
      osd: 24 osds: 23 up, 23 in
      rgw: 2 daemons active
    data:
      pools:   18 pools, 182 pgs
      objects: 240 objects, 3359 bytes
      usage:   2548 MB used, 42814 GB / 42816 GB avail
      pgs:     182 active+clean
 .. code-block:: console
  (mon-pod):/# ceph osd tree
  ID CLASS WEIGHT   TYPE NAME         STATUS REWEIGHT PRI-AFF
  -1       43.67981 root default
  -9       10.91995     host voyager1
   5   hdd  1.81999         osd.5         up  1.00000 1.00000
   6   hdd  1.81999         osd.6         up  1.00000 1.00000
  10   hdd  1.81999         osd.10        up  1.00000 1.00000
  17   hdd  1.81999         osd.17        up  1.00000 1.00000
  19   hdd  1.81999         osd.19        up  1.00000 1.00000
  21   hdd  1.81999         osd.21        up  1.00000 1.00000
  -3       10.91995     host voyager2
   1   hdd  1.81999         osd.1         up  1.00000 1.00000
   4   hdd  1.81999         osd.4         up  1.00000 1.00000
  11   hdd  1.81999         osd.11        up  1.00000 1.00000
  13   hdd  1.81999         osd.13        up  1.00000 1.00000
  16   hdd  1.81999         osd.16        up  1.00000 1.00000
  18   hdd  1.81999         osd.18        up  1.00000 1.00000
  -2       10.91995     host voyager3
   0   hdd  1.81999         osd.0         up  1.00000 1.00000
   3   hdd  1.81999         osd.3         up  1.00000 1.00000
  12   hdd  1.81999         osd.12        up  1.00000 1.00000
  20   hdd  1.81999         osd.20        up  1.00000 1.00000
  22   hdd  1.81999         osd.22        up  1.00000 1.00000
  23   hdd  1.81999         osd.23        up  1.00000 1.00000
  -4       10.91995     host voyager4
   2   hdd  1.81999         osd.2       down        0 1.00000
   7   hdd  1.81999         osd.7         up  1.00000 1.00000
   8   hdd  1.81999         osd.8         up  1.00000 1.00000
   9   hdd  1.81999         osd.9         up  1.00000 1.00000
  14   hdd  1.81999         osd.14        up  1.00000 1.00000
  15   hdd  1.81999         osd.15        up  1.00000 1.00000
 Solution:
 ---------
 To replace the failed OSD, excecute the following procedure:
 1. From the Kubernetes cluster, remove the failed OSD pod, which is running on ``voyager4``:
 .. code-block:: console
  $ kubectl label nodes --all ceph_maintenance_window=inactive
  $ kubectl label nodes voyager4 --overwrite ceph_maintenance_window=active
  $ kubectl patch -n ceph ds ceph-osd-default-64779b8c -p='{"spec":{"template":{"spec":{"nodeSelector":{"ceph-osd":"enabled","ceph_maintenance_window":"inactive"}}}}}'
 Note: To find the daemonset associated with a failed OSD, check out the followings:
 .. code-block:: console
  (voyager4)$ ps -ef|grep /usr/bin/ceph-osd
  (voyager1)$ kubectl get ds -n ceph
  (voyager1)$ kubectl get ds <daemonset-name> -n ceph -o yaml
 3. Remove the failed OSD (OSD ID = 2 in this example) from the Ceph cluster:
 .. code-block:: console
  (mon-pod):/# ceph osd lost 2
  (mon-pod):/# ceph osd crush remove osd.2
  (mon-pod):/# ceph auth del osd.2
  (mon-pod):/# ceph osd rm 2
 4. Find that Ceph is healthy with a lost OSD (i.e., a total of 23 OSDs):
 .. code-block:: console
  (mon-pod):/# ceph -s
    cluster:
      id:     9d4d8c61-cf87-4129-9cef-8fbf301210ad
      health: HEALTH_WARN
              too few PGs per OSD (23 < min 30)
              mon voyager1 is low on available space
    services:
      mon: 3 daemons, quorum voyager1,voyager2,voyager3
      mgr: voyager1(active), standbys: voyager3
      mds: cephfs-1/1/1 up  {0=mds-ceph-mds-65bb45dffc-cslr6=up:active}, 1 up:standby
      osd: 23 osds: 23 up, 23 in
      rgw: 2 daemons active
    data:
      pools:   18 pools, 182 pgs
      objects: 240 objects, 3359 bytes
      usage:   2551 MB used, 42814 GB / 42816 GB avail
      pgs:     182 active+clean
 5. Replace the failed disk with a new one. If you repair (not replace) the failed disk,
 you may need to run the following:
 .. code-block:: console
  (voyager4)$ parted /dev/sdh mklabel msdos
 6. Start a new OSD pod on ``voyager4``:
 .. code-block:: console
  $ kubectl label nodes voyager4 --overwrite ceph_maintenance_window=inactive
 7. Validate the Ceph status (i.e., one OSD is added, so the total number of OSDs becomes 24):
 .. code-block:: console
  (mon-pod):/# ceph -s
    cluster:
      id:     9d4d8c61-cf87-4129-9cef-8fbf301210ad
      health: HEALTH_WARN
              too few PGs per OSD (22 < min 30)
              mon voyager1 is low on available space
    services:
      mon: 3 daemons, quorum voyager1,voyager2,voyager3
      mgr: voyager1(active), standbys: voyager3
      mds: cephfs-1/1/1 up  {0=mds-ceph-mds-65bb45dffc-cslr6=up:active}, 1 up:standby
      osd: 24 osds: 24 up, 24 in
      rgw: 2 daemons active
    data:
      pools:   18 pools, 182 pgs
      objects: 240 objects, 3359 bytes
      usage:   2665 MB used, 44675 GB / 44678 GB avail
      pgs:     182 active+clean
--- a/doc/source/testing/ceph-resiliency/host-failure.rst
+++ b/doc/source/testing/ceph-resiliency/host-failure.rst
@ -0,0 +1,98 @@
 ============
 Host Failure
 ============
 Test Environment
 ================
 - Cluster size: 4 host machines
 - Number of disks: 24 (= 6 disks per host * 4 hosts)
 - Kubernetes version: 1.10.5
 - Ceph version: 12.2.3
 - OpenStack-Helm commit: 25e50a34c66d5db7604746f4d2e12acbdd6c1459
 Case: One host machine where ceph-mon is running is rebooted
 ============================================================
 Symptom:
 --------
 After reboot (node voyager3), the node status changes to ``NotReady``.
 .. code-block:: console
  $ kubectl get nodes
  NAME       STATUS     ROLES     AGE       VERSION
  voyager1   Ready      master    6d        v1.10.5
  voyager2   Ready      <none>    6d        v1.10.5
  voyager3   NotReady   <none>    6d        v1.10.5
  voyager4   Ready      <none>    6d        v1.10.5
 Ceph status shows that ceph-mon running on ``voyager3`` becomes out of quorum.
 Also, six osds running on ``voyager3`` are down; i.e., 18 osds are up out of 24 osds.
 .. code-block:: console
  (mon-pod):/# ceph -s
    cluster:
      id:     9d4d8c61-cf87-4129-9cef-8fbf301210ad
      health: HEALTH_WARN
              6 osds down
              1 host (6 osds) down
              Degraded data redundancy: 195/624 objects degraded (31.250%), 8 pgs degraded
              too few PGs per OSD (17 < min 30)
              mon voyager1 is low on available space
              1/3 mons down, quorum voyager1,voyager2
    services:
      mon: 3 daemons, quorum voyager1,voyager2, out of quorum: voyager3
      mgr: voyager1(active), standbys: voyager3
      mds: cephfs-1/1/1 up  {0=mds-ceph-mds-65bb45dffc-cslr6=up:active}, 1 up:standby
      osd: 24 osds: 18 up, 24 in
      rgw: 2 daemons active
    data:
      pools:   18 pools, 182 pgs
      objects: 208 objects, 3359 bytes
      usage:   2630 MB used, 44675 GB / 44678 GB avail
      pgs:     195/624 objects degraded (31.250%)
               126 active+undersized
               48  active+clean
               8   active+undersized+degraded
 Recovery:
 ---------
 The node status of ``voyager3`` changes to ``Ready`` after the node is up again.
 Also, Ceph pods are restarted automatically.
 Ceph status shows that the monitor running on ``voyager3`` is now in quorum.
 .. code-block:: console
  $ kubectl get nodes
  NAME       STATUS    ROLES     AGE       VERSION
  voyager1   Ready     master    6d        v1.10.5
  voyager2   Ready     <none>    6d        v1.10.5
  voyager3   Ready     <none>    6d        v1.10.5
  voyager4   Ready     <none>    6d        v1.10.5
 .. code-block:: console
  (mon-pod):/# ceph -s
    cluster:
      id:     9d4d8c61-cf87-4129-9cef-8fbf301210ad
      health: HEALTH_WARN
              too few PGs per OSD (22 < min 30)
              mon voyager1 is low on available space
    services:
      mon: 3 daemons, quorum voyager1,voyager2,voyager3
      mgr: voyager1(active), standbys: voyager3
      mds: cephfs-1/1/1 up  {0=mds-ceph-mds-65bb45dffc-cslr6=up:active}, 1 up:standby
      osd: 24 osds: 24 up, 24 in
      rgw: 2 daemons active
    data:
      pools:   18 pools, 182 pgs
      objects: 208 objects, 3359 bytes
      usage:   2635 MB used, 44675 GB / 44678 GB avail
      pgs:     182 active+clean
--- a/doc/source/testing/ceph-resiliency/index.rst
+++ b/doc/source/testing/ceph-resiliency/index.rst
@ -0,0 +1,12 @@
 ===============
 Ceph Resiliency
 ===============
 .. toctree::
   :maxdepth: 2
   README
   monitor-failure
   osd-failure
   disk-failure
   host-failure
--- a/doc/source/testing/ceph-resiliency/monitor-failure.rst
+++ b/doc/source/testing/ceph-resiliency/monitor-failure.rst
@ -0,0 +1,125 @@
 ===============
 Monitor Failure
 ===============
 Test Environment
 ================
 - Cluster size: 4 host machines
 - Number of disks: 24 (= 6 disks per host * 4 hosts)
 - Kubernetes version: 1.9.3
 - Ceph version: 12.2.3
 - OpenStack-Helm commit: 28734352741bae228a4ea4f40bcacc33764221eb
 We have 3 Monitors in this Ceph cluster, one on each of the 3 Monitor
 hosts.
 Case: 1 out of 3 Monitor Processes is Down
 ==========================================
 This is to test a scenario when 1 out of 3 Monitor processes is down.
 To bring down 1 Monitor process (out of 3), we identify a Monitor
 process and kill it from the monitor host (not a pod).
 .. code-block:: console
  $ ps -ef | grep ceph-mon
  ceph     16112 16095  1 14:58 ?        00:00:03 /usr/bin/ceph-mon --cluster ceph --setuser ceph --setgroup ceph -d -i voyager2 --mon-data /var/lib/ceph/mon/ceph-voyager2 --public-addr 135.207.240.42:6789
  $ sudo kill -9 16112
 In the mean time, we monitored the status of Ceph and noted that it
 takes about 24 seconds for the killed Monitor process to recover from
 ``down`` to ``up``. The reason is that Kubernetes automatically
 restarts pods whenever they are killed.
 .. code-block:: console
  (mon-pod):/# ceph -s
    cluster:
      id:     fd366aef-b356-4fe7-9ca5-1c313fe2e324
      health: HEALTH_WARN
              mon voyager1 is low on available space
              1/3 mons down, quorum voyager1,voyager3
    services:
      mon: 3 daemons, quorum voyager1,voyager3, out of quorum: voyager2
      mgr: voyager4(active)
      osd: 24 osds: 24 up, 24 in
 .. code-block:: console
  (mon-pod):/# ceph -s
    cluster:
      id:     fd366aef-b356-4fe7-9ca5-1c313fe2e324
      health: HEALTH_WARN
              mon voyager1 is low on available space
              1/3 mons down, quorum voyager1,voyager2
    services:
      mon: 3 daemons, quorum voyager1,voyager2,voyager3
      mgr: voyager4(active)
      osd: 24 osds: 24 up, 24 in
 We also monitored the status of the Monitor pod through ``kubectl get
 pods -n ceph``, and the status of the pod (where a Monitor process is
 killed) changed as follows: ``Running`` -> ``Error`` -> ``Running``
 and this recovery process takes about 24 seconds.
 Case: 2 out of 3 Monitor Processes are Down
 ===========================================
 This is to test a scenario when 2 out of 3 Monitor processes are down.
 To bring down 2 Monitor processes (out of 3), we identify two Monitor
 processes and kill them from the 2 monitor hosts (not a pod).
 We monitored the status of Ceph when the Monitor processes are killed
 and noted that the symptoms are similar to when 1 Monitor process is
 killed:
 - It takes longer (about 1 minute) for the killed Monitor processes to
  recover from ``down`` to ``up``.
 - The status of the pods (where the two Monitor processes are killed)
  changed as follows: ``Running`` -> ``Error`` -> ``CrashLoopBackOff``
  -> ``Running`` and this recovery process takes about 1 minute.
 Case: 3 out of 3 Monitor Processes are Down
 ===========================================
 This is to test a scenario when 3 out of 3 Monitor processes are down.
 To bring down 3 Monitor processes (out of 3), we identify all 3
 Monitor processes and kill them from the 3 monitor hosts (not pods).
 We monitored the status of Ceph Monitor pods and noted that the
 symptoms are similar to when 1 or 2 Monitor processes are killed:
 .. code-block:: console
  $ kubectl get pods -n ceph -o wide | grep ceph-mon
  NAME                                       READY     STATUS    RESTARTS   AGE
  ceph-mon-8tml7                             0/1       Error     4          10d
  ceph-mon-kstf8                             0/1       Error     4          10d
  ceph-mon-z4sl9                             0/1       Error     7          10d
 .. code-block:: console
  $ kubectl get pods -n ceph -o wide | grep ceph-mon
  NAME                                       READY     STATUS               RESTARTS   AGE
  ceph-mon-8tml7                             0/1       CrashLoopBackOff     4          10d
  ceph-mon-kstf8                             0/1       Error                4          10d
  ceph-mon-z4sl9                             0/1       CrashLoopBackOff     7          10d
 .. code-block:: console
  $ kubectl get pods -n ceph -o wide | grep ceph-mon
  NAME                                       READY     STATUS    RESTARTS   AGE
  ceph-mon-8tml7                             1/1       Running   5          10d
  ceph-mon-kstf8                             1/1       Running   5          10d
  ceph-mon-z4sl9                             1/1       Running   8          10d
 The status of the pods (where the three Monitor processes are killed)
 changed as follows: ``Running`` -> ``Error`` -> ``CrashLoopBackOff``
 -> ``Running`` and this recovery process takes about 1 minute.
--- a/doc/source/testing/ceph-resiliency/osd-failure.rst
+++ b/doc/source/testing/ceph-resiliency/osd-failure.rst
@ -0,0 +1,107 @@
 ===========
 OSD Failure
 ===========
 Test Environment
 ================
 - Cluster size: 4 host machines
 - Number of disks: 24 (= 6 disks per host * 4 hosts)
 - Kubernetes version: 1.9.3
 - Ceph version: 12.2.3
 - OpenStack-Helm commit: 28734352741bae228a4ea4f40bcacc33764221eb
 Case: OSD processes are killed
 ==============================
 This is to test a scenario when some of the OSDs are down.
 To bring down 6 OSDs (out of 24), we identify the OSD processes and
 kill them from a storage host (not a pod).
 .. code-block:: console
  $ ps -ef|grep /usr/bin/ceph-osd
  ceph     44587 43680  1 18:12 ?        00:00:01 /usr/bin/ceph-osd --cluster ceph --osd-journal /dev/sdb5 -f -i 4 --setuser ceph --setgroup disk
  ceph     44627 43744  1 18:12 ?        00:00:01 /usr/bin/ceph-osd --cluster ceph --osd-journal /dev/sdb2 -f -i 6 --setuser ceph --setgroup disk
  ceph     44720 43927  2 18:12 ?        00:00:01 /usr/bin/ceph-osd --cluster ceph --osd-journal /dev/sdb6 -f -i 3 --setuser ceph --setgroup disk
  ceph     44735 43868  1 18:12 ?        00:00:01 /usr/bin/ceph-osd --cluster ceph --osd-journal /dev/sdb1 -f -i 9 --setuser ceph --setgroup disk
  ceph     44806 43855  1 18:12 ?        00:00:01 /usr/bin/ceph-osd --cluster ceph --osd-journal /dev/sdb4 -f -i 0 --setuser ceph --setgroup disk
  ceph     44896 44011  2 18:12 ?        00:00:01 /usr/bin/ceph-osd --cluster ceph --osd-journal /dev/sdb3 -f -i 1 --setuser ceph --setgroup disk
  root     46144 45998  0 18:13 pts/10   00:00:00 grep --color=auto /usr/bin/ceph-osd
  $ sudo kill -9 44587 44627 44720 44735 44806 44896
 .. code-block:: console
  (mon-pod):/# ceph -s
    cluster:
      id:     fd366aef-b356-4fe7-9ca5-1c313fe2e324
      health: HEALTH_WARN
              6 osds down
              1 host (6 osds) down
              Reduced data availability: 8 pgs inactive, 58 pgs peering
              Degraded data redundancy: 141/1002 objects degraded (14.072%), 133 pgs degraded
              mon voyager1 is low on available space
    services:
      mon: 3 daemons, quorum voyager1,voyager2,voyager3
      mgr: voyager4(active)
      osd: 24 osds: 18 up, 24 in
 In the mean time, we monitor the status of Ceph and noted that it takes about 30 seconds for the 6 OSDs to recover from ``down`` to ``up``.
 The reason is that Kubernetes automatically restarts OSD pods whenever they are killed.
 .. code-block:: console
  (mon-pod):/# ceph -s
    cluster:
      id:     fd366aef-b356-4fe7-9ca5-1c313fe2e324
      health: HEALTH_WARN
              mon voyager1 is low on available space
    services:
      mon: 3 daemons, quorum voyager1,voyager2,voyager3
      mgr: voyager4(active)
      osd: 24 osds: 24 up, 24 in
 Case: A OSD pod is deleted
 ==========================
 This is to test a scenario when an OSD pod is deleted by ``kubectl delete $OSD_POD_NAME``.
 Meanwhile, we monitor the status of Ceph and noted that it takes about 90 seconds for the OSD running in deleted pod to recover from ``down`` to ``up``.
 .. code-block:: console
  root@voyager3:/# ceph -s
    cluster:
      id:     fd366aef-b356-4fe7-9ca5-1c313fe2e324
      health: HEALTH_WARN
              1 osds down
              Degraded data redundancy: 43/945 objects degraded (4.550%), 35 pgs degraded, 109 pgs undersized
              mon voyager1 is low on available space
    services:
      mon: 3 daemons, quorum voyager1,voyager2,voyager3
      mgr: voyager4(active)
      osd: 24 osds: 23 up, 24 in
 .. code-block:: console
  (mon-pod):/# ceph -s
    cluster:
      id:     fd366aef-b356-4fe7-9ca5-1c313fe2e324
      health: HEALTH_WARN
              mon voyager1 is low on available space
    services:
      mon: 3 daemons, quorum voyager1,voyager2,voyager3
      mgr: voyager4(active)
      osd: 24 osds: 24 up, 24 in
 We also monitored the pod status through ``kubectl get pods -n ceph``
 during this process. The deleted OSD pod status changed as follows:
 ``Terminating`` -> ``Init:1/3`` -> ``Init:2/3`` -> ``Init:3/3`` ->
 ``Running``, and this process taks about 90 seconds. The reason is
 that Kubernetes automatically restarts OSD pods whenever they are
 deleted.
--- a/doc/source/testing/helm-tests.rst
+++ b/doc/source/testing/helm-tests.rst
@ -1,9 +1,6 @@
-=======
+==========
 Testing
 =======
 Helm Tests
----------
+==========
 Every OpenStack-Helm chart should include any required Helm tests necessary to
 provide a sanity check for the OpenStack service.  Information on using the Helm
@ -27,7 +24,6 @@ chart.  If Rally tests are not appropriate or adequate for a service chart, any
 additional tests should be documented appropriately and adhere to the same
 expectations.
 Running Tests
 -------------
--- a/doc/source/testing/index.rst
+++ b/doc/source/testing/index.rst
@ -0,0 +1,9 @@
 =======
 Testing
 =======
 .. toctree::
   :maxdepth: 2
   helm-tests
   ceph-resiliency/index
--- a/doc/source/troubleshooting/ceph.rst
+++ b/doc/source/troubleshooting/ceph.rst
@ -0,0 +1,59 @@
 Backing up a PVC
 ^^^^^^^^^^^^^^^^
 Backing up a PVC stored in Ceph, is fairly straigthforward, in this example we
 use the PVC ``mysql-data-mariadb-server-0`` as an example, but this will also
 apply to any other services using PVCs eg. RabbitMQ, Postgres.
 .. code-block:: shell
    #  get all required details
    NS_NAME="openstack"
    PVC_NAME="mysql-data-mariadb-server-0"
    # you can check this by running  kubectl get pvc -n ${NS_NAME}
    PV_NAME="$(kubectl get -n ${NS_NAME} pvc "${PVC_NAME}" --no-headers | awk '{ print $3 }')"
    RBD_NAME="$(kubectl get pv "${PV_NAME}" -o json | jq -r '.spec.rbd.image')"
    MON_POD=$(kubectl get pods \
      --namespace=ceph \
      --selector="application=ceph" \
      --selector="component=mon" \
      --no-headers | awk '{ print $1; exit }')
    # copy admin keyring from ceph mon to host node
    kubectl exec -it ${MON_POD} -n ceph -- cat /etc/ceph/ceph.client.admin.keyring > /etc/ceph/ceph.client.admin.keyring
    sudo kubectl get cm -n ceph ceph-etc -o json|jq -j  .data[] > /etc/ceph/ceph.conf
    export CEPH_MON_NAME="ceph-mon-discovery.ceph.svc.cluster.local"
    # create snapshot and  export to a file
    rbd snap create rbd/${RBD_NAME}@snap1 -m ${CEPH_MON_NAME}
    rbd snap list rbd/${RBD_NAME} -m ${CEPH_MON_NAME}
    # Export the snapshot and compress , make sure we have enough space on host to accomidate big files that we are working .
    # a. if we have enough space on host
    rbd export rbd/${RBD_NAME}@snap1 /backup/${RBD_NAME}.img -m ${CEPH_MON_NAME}
    cd /backup
    time xz -0vk --threads=0  /backup/${RBD_NAME}.img
    # b. if we have less space on host we can directly export  and compress in single command
    rbd export rbd/${RBD_NAME}@snap1 -m ${CEPH_MON_NAME} - | xz  -0v --threads=0 >  /backup/${RBD_NAME}.img.xz
 Restoring is just as straightforward. Once the workload consuming the device has
 been stopped, and the raw RBD device removed the following will import the
 back up and create a device:
 .. code-block:: shell
    cd /backup
    unxz -k ${RBD_NAME}.img.xz
    rbd import /backup/${RBD_NAME}.img rbd/${RBD_NAME} -m ${CEPH_MON_NAME}
 Once this has been done the workload can be restarted.
--- a/doc/source/troubleshooting/index.rst
+++ b/doc/source/troubleshooting/index.rst
@ -11,6 +11,7 @@ Sometimes things go wrong. These guides will help you solve many common issues w
   persistent-storage
   proxy
   ubuntu-hwe-kernel
   ceph
 Getting help
 ============