Merge "[rook-ceph] Document the procedure for migrating Ceph to Rook"

2024-03-25 20:22:22 +00:00 · 2024-03-25 20:22:22 +00:00 · 1409a039e0
commit 1409a039e0
parent 82d1b86b2a e667ddbb21
2 changed files with 100 additions and 0 deletions
--- a/doc/source/troubleshooting/index.rst
+++ b/doc/source/troubleshooting/index.rst
@ -12,6 +12,7 @@ issues with the following:
   persistent-storage
   ubuntu-hwe-kernel
   ceph
+   migrate-ceph-to-rook

 Getting help
 ============
--- a/doc/source/troubleshooting/migrate-ceph-to-rook.rst
+++ b/doc/source/troubleshooting/migrate-ceph-to-rook.rst
@ -0,0 +1,99 @@
+Migrating Ceph to Rook
+^^^^^^^^^^^^^^^^^^^^^^
+
+It may be necessary or desired to migrate an existing `Ceph`_ cluster that was
+originally deployed using the Ceph charts in `openstack-helm-infra`_ to be
+managed by the Rook operator moving forward. This operation is not a supported
+`Rook`_ feature, but it is possible to achieve.
+
+The procedure must completely stop and start all of the Ceph pods a few times
+during the migration process and is therefore disruptive to Ceph operations,
+but the result is a Ceph cluster deployed and managed by the Rook operator that
+maintains the same cluster FSID as the original with all OSD data preserved.
+
+The steps involved in migrating a legacy openstack-helm Ceph cluster to Rook
+are based on the Rook `Troubleshooting`_ documentation and are outlined below.
+
+#. Retrieve the cluster FSID, the name and IP address of an existing ceph-mon
+   host that contains a healthy monitor store, and the numbers of existing
+   ceph-mon and ceph-mgr pods from the existing Ceph cluster and save
+   this information for later.
+#. Rename CephFS pools so that they use the naming convention expected by Rook.
+   CephFS pools names are not customizable with Rook, so new metadata and data
+   pools will be created for any filesystems deployed by Rook if the existing
+   pools are not named as expected. Pools must be named
+   "<filesystem name>-metadata" and "<filesystem name>-data" in order for Rook to
+   use existing CephFS pools.
+#. Delete Ceph resources deployed via the openstack-helm-infra charts,
+   uninstall the charts, and remove Ceph node labels.
+#. Add the Rook Helm repository and deploy the Rook operator and a minimal Ceph
+   cluster using the Rook Helm charts. The cluster will have a new FSID and will
+   not include any OSDs because the OSD disks are all initialized in the old Ceph
+   cluster and that will be recognized.
+#. Save the Ceph monitor keyring, host, and IP address, as well as the IP
+   address of the new monitor itself, for later use.
+#. Stop the Rook operator by scaling the rook-ceph-operator deployment in the
+   Rook operator namespace to zero replicas and destroy the newly-deployed Ceph
+   cluster by deleting all deployments in the Rook Ceph cluster namespace except
+   for rook-ceph-tools.
+#. Copy the store from an old monitor saved in the first step to the new
+   monitor and update its keyring with the key saved from the new monitor.
+#. Edit the monmap in the copied monitor store using monmaptool to remove all
+   of the old monitors from the monmap and add a single new one using the new
+   monitor's IP address.
+#. Edit the rook-ceph-mon secret in the Rook Ceph cluster namespace and
+   overwrite the base64-encoded FSID with the base64 encoding of the FSID from the
+   old Ceph cluster. Make sure the encoding includes the FSID only, no whitespace
+   or newline characters.
+#. Edit the rook-config-override configmap in the Rook Ceph cluster namespace
+   and set auth_supported, auth_client_required, auth_service_required, and
+   auth_cluster_required all to none in the global config section to disable
+   authentication in the Ceph cluster.
+#. Scale the rook-ceph-operator deployment back to one replica to deploy the
+   Ceph cluster again. This time, the cluster will have the old cluster's FSID
+   and the OSD pods will come up in the new cluster with their data intact.
+#. Using 'ceph auth import' from the rook-ceph-tools pod, import the
+   [client.admin] portion of the key saved from the new monitor from its initial
+   Rook deployment.
+#. Edit the rook-config-override configmap again and remove the settings
+   previously added to disable authentication. This will re-enable authentication
+   when Ceph daemons are restarted.
+#. Scale the rook-ceph-operator deployment to zero replicas and delete the Ceph
+   cluster deployments again to destroy the Ceph cluster one more time.
+#. When everything has terminated, immediately scale the rook-ceph-operator
+   deployment back to one to deploy the Ceph cluster one final time.
+#. After the Ceph cluster has been deployed again and all of the daemons are
+   running (with authentication enabled now), edit the deployed cephcluster
+   resource in the Rook Ceph cluster namespace to set the mon and mgr counts to
+   their original values saved in the first step.
+#. Now you have a Ceph cluster, deployed and managed by Rook, with the original
+   FSID and all of its data intact, with the same number of monitors and managers
+   that existed in the previous deployment.
+
+There is a rudimentary `script`_ provided that automates this process. It
+isn't meant to be a complete solution and isn't supported as such. It is simply
+an example that is known to work for some test implementations.
+
+The script makes use of environment variables to tell it which Rook release and
+Ceph release to deploy, in which Kubernetes namespaces to deploy the Rook
+operator and Ceph cluster, and paths to YAML files that contain the necessary
+definitions for the Rook operator and Rook Ceph cluster. All of these have
+default values and are not required to be set, but it will likely be necessary
+at least to define paths to the YAML files required to deploy the Rook operator
+and Ceph cluster. Please refer to the comments near the top of the script for
+more information about utilizing these environment variables.
+
+The Ceph cluster definition provided to Rook should match the existing Ceph
+cluster as closely as possible. Otherwise, the migration may not migrate Ceph
+cluster resources as expected. The migration of deployed Ceph resources is
+unique to each deployment, so sample definitions are not provided here.
+
+Migrations using this procedure and/or script are very delicate and will
+require a lot of testing prior to being implemented in production. This is a
+risky operation even with testing and should be performed very carefully.
+
+.. _Ceph: https://ceph.io
+.. _openstack-helm-infra: https://opendev.org/openstack/openstack-helm-infra
+.. _Rook: https://rook.io
+.. _Troubleshooting: https://rook.io/docs/rook/latest-release/Troubleshooting/disaster-recovery/#adopt-an-existing-rook-ceph-cluster-into-a-new-kubernetes-cluster
+.. _script: https://opendev.org/openstack/openstack-helm-infra/src/tools/deployment/ceph/migrate-to-rook-ceph.sh