[rook-ceph] Document the procedure for migrating Ceph to Rook

This change adds documentation to describe the procedure and
example script for migrating a Ceph cluster deployed with the
legacy openstack-helm-infra Ceph charts to be deployed and managed
by Rook. Rook is the supported mechanism for running Ceph on
Kubernetes and is now the default method for deploying Ceph in
openstack-helm.

Change-Id: I43644295d67ba5efc007bbe1ee194e1dd923677b
This commit is contained in:
Stephen Taylor 2024-03-22 11:59:48 -06:00 committed by Vladimir Kozhukalov
parent b6e6dad141
commit e667ddbb21
2 changed files with 100 additions and 0 deletions

View File

@ -12,6 +12,7 @@ issues with the following:
persistent-storage
ubuntu-hwe-kernel
ceph
migrate-ceph-to-rook
Getting help
============

View File

@ -0,0 +1,99 @@
Migrating Ceph to Rook
^^^^^^^^^^^^^^^^^^^^^^
It may be necessary or desired to migrate an existing `Ceph`_ cluster that was
originally deployed using the Ceph charts in `openstack-helm-infra`_ to be
managed by the Rook operator moving forward. This operation is not a supported
`Rook`_ feature, but it is possible to achieve.
The procedure must completely stop and start all of the Ceph pods a few times
during the migration process and is therefore disruptive to Ceph operations,
but the result is a Ceph cluster deployed and managed by the Rook operator that
maintains the same cluster FSID as the original with all OSD data preserved.
The steps involved in migrating a legacy openstack-helm Ceph cluster to Rook
are based on the Rook `Troubleshooting`_ documentation and are outlined below.
#. Retrieve the cluster FSID, the name and IP address of an existing ceph-mon
host that contains a healthy monitor store, and the numbers of existing
ceph-mon and ceph-mgr pods from the existing Ceph cluster and save
this information for later.
#. Rename CephFS pools so that they use the naming convention expected by Rook.
CephFS pools names are not customizable with Rook, so new metadata and data
pools will be created for any filesystems deployed by Rook if the existing
pools are not named as expected. Pools must be named
"<filesystem name>-metadata" and "<filesystem name>-data" in order for Rook to
use existing CephFS pools.
#. Delete Ceph resources deployed via the openstack-helm-infra charts,
uninstall the charts, and remove Ceph node labels.
#. Add the Rook Helm repository and deploy the Rook operator and a minimal Ceph
cluster using the Rook Helm charts. The cluster will have a new FSID and will
not include any OSDs because the OSD disks are all initialized in the old Ceph
cluster and that will be recognized.
#. Save the Ceph monitor keyring, host, and IP address, as well as the IP
address of the new monitor itself, for later use.
#. Stop the Rook operator by scaling the rook-ceph-operator deployment in the
Rook operator namespace to zero replicas and destroy the newly-deployed Ceph
cluster by deleting all deployments in the Rook Ceph cluster namespace except
for rook-ceph-tools.
#. Copy the store from an old monitor saved in the first step to the new
monitor and update its keyring with the key saved from the new monitor.
#. Edit the monmap in the copied monitor store using monmaptool to remove all
of the old monitors from the monmap and add a single new one using the new
monitor's IP address.
#. Edit the rook-ceph-mon secret in the Rook Ceph cluster namespace and
overwrite the base64-encoded FSID with the base64 encoding of the FSID from the
old Ceph cluster. Make sure the encoding includes the FSID only, no whitespace
or newline characters.
#. Edit the rook-config-override configmap in the Rook Ceph cluster namespace
and set auth_supported, auth_client_required, auth_service_required, and
auth_cluster_required all to none in the global config section to disable
authentication in the Ceph cluster.
#. Scale the rook-ceph-operator deployment back to one replica to deploy the
Ceph cluster again. This time, the cluster will have the old cluster's FSID
and the OSD pods will come up in the new cluster with their data intact.
#. Using 'ceph auth import' from the rook-ceph-tools pod, import the
[client.admin] portion of the key saved from the new monitor from its initial
Rook deployment.
#. Edit the rook-config-override configmap again and remove the settings
previously added to disable authentication. This will re-enable authentication
when Ceph daemons are restarted.
#. Scale the rook-ceph-operator deployment to zero replicas and delete the Ceph
cluster deployments again to destroy the Ceph cluster one more time.
#. When everything has terminated, immediately scale the rook-ceph-operator
deployment back to one to deploy the Ceph cluster one final time.
#. After the Ceph cluster has been deployed again and all of the daemons are
running (with authentication enabled now), edit the deployed cephcluster
resource in the Rook Ceph cluster namespace to set the mon and mgr counts to
their original values saved in the first step.
#. Now you have a Ceph cluster, deployed and managed by Rook, with the original
FSID and all of its data intact, with the same number of monitors and managers
that existed in the previous deployment.
There is a rudimentary `script`_ provided that automates this process. It
isn't meant to be a complete solution and isn't supported as such. It is simply
an example that is known to work for some test implementations.
The script makes use of environment variables to tell it which Rook release and
Ceph release to deploy, in which Kubernetes namespaces to deploy the Rook
operator and Ceph cluster, and paths to YAML files that contain the necessary
definitions for the Rook operator and Rook Ceph cluster. All of these have
default values and are not required to be set, but it will likely be necessary
at least to define paths to the YAML files required to deploy the Rook operator
and Ceph cluster. Please refer to the comments near the top of the script for
more information about utilizing these environment variables.
The Ceph cluster definition provided to Rook should match the existing Ceph
cluster as closely as possible. Otherwise, the migration may not migrate Ceph
cluster resources as expected. The migration of deployed Ceph resources is
unique to each deployment, so sample definitions are not provided here.
Migrations using this procedure and/or script are very delicate and will
require a lot of testing prior to being implemented in production. This is a
risky operation even with testing and should be performed very carefully.
.. _Ceph: https://ceph.io
.. _openstack-helm-infra: https://opendev.org/openstack/openstack-helm-infra
.. _Rook: https://rook.io
.. _Troubleshooting: https://rook.io/docs/rook/latest-release/Troubleshooting/disaster-recovery/#adopt-an-existing-rook-ceph-cluster-into-a-new-kubernetes-cluster
.. _script: https://opendev.org/openstack/openstack-helm-infra/src/tools/deployment/ceph/migrate-to-rook-ceph.sh