porthole/doc/ceph_maintenance.md
Sergiy Markin 693f99363e Update porthole project
This PS performs the following changes:
- updates approach to freeze requirements.txt
- adds freeze tox profile
- upgrades helm to v3.9.4
- changes deployment scripts in accordance with new helm v3
- python code has been re-styled to pass pep8 tests
- added tox-docs zuul gate
- added tox-py38 zuul gate
- added tox-cover zuul gate
- added tox-pep8 zuul gate
- deprecated old unit-tests zuul gate
- added a dependency pre-run playbook to deliver zuul node setup needed
  for python tox gates to run unit tests
- added tox profiles for py38,pep8,docs and cover tests

Change-Id: I960326fb0ab8d98cc3f62ffa638286e4fdcbb7c7
2023-06-02 16:01:08 +00:00

2.2 KiB
Raw Blame History

Ceph Maintenance

This document provides procedures for maintaining Ceph OSDs.

Check OSD Status

To check the current status of OSDs, execute the following.

utilscli osd-maintenance check_osd_status

OSD Removal

To purge OSDs that are in the down state, execute the following.

utilscli osd-maintenance osd_remove

OSD Removal by OSD ID

To purge down OSDs by specifying OSD ID, execute the following.

utilscli osd-maintenance remove_osd_by_id --osd-id <OSDID>

Reweight OSDs

To adjust an OSDs crush weight in the CRUSH map of a running cluster, execute the following.

utilscli osd-maintenance reweight_osds

Replace a Failed OSD

If a drive fails, follow these steps to replace a failed OSD.

  1. Disable the OSD pod on the host to keep it from being rescheduled.
    kubectl label nodes --all ceph_maintenance_window=inactive
  1. Below, replace <NODE> with the name of the node where the failed OSD pods exist.
    kubectl label nodes <NODE> --overwrite ceph_maintenance_window=active
  1. Below, replace <POD_NAME> with the failed OSD pod name.
    kubectl patch -n ceph ds <POD_NAME> -p='{"spec":{"template":{"spec":{"nodeSelector":{"ceph-osd":"enabled","ceph_maintenance_window":"inactive"}}}}}'

Complete the recovery by executing the following commands from the Ceph utility container.

  1. Capture the failed OSD ID. Check for status down.
    utilscli ceph osd tree
  1. Remove the OSD from the cluster. Below, replace <OSD_ID> with the ID of the failed OSD.
    utilscli osd-maintenance osd_remove_by_id --osd-id <OSD_ID>
  1. Remove the failed drive and replace it with a new one without bringing down the node.

  2. Once the new drive is in place, change the label and delete the OSD pod that is in the error or CrashLoopBackOff state. Below, replace <POD_NAME> with the failed OSD pod name.

    kubectl label nodes <NODE> --overwrite ceph_maintenance_window=inactive
    kubectl delete pod <POD_NAME> -n ceph

Once the pod is deleted, Kubernetes will re-spin a new pod for the OSD. Once the pod is up, the OSD is added to the Ceph cluster with a weight equal to 0. Re-weight the OSD.

    utilscli osd-maintenance reweight_osds