porthole/doc/ceph_maintenance.md
Sergiy Markin 693f99363e Update porthole project
This PS performs the following changes:
- updates approach to freeze requirements.txt
- adds freeze tox profile
- upgrades helm to v3.9.4
- changes deployment scripts in accordance with new helm v3
- python code has been re-styled to pass pep8 tests
- added tox-docs zuul gate
- added tox-py38 zuul gate
- added tox-cover zuul gate
- added tox-pep8 zuul gate
- deprecated old unit-tests zuul gate
- added a dependency pre-run playbook to deliver zuul node setup needed
  for python tox gates to run unit tests
- added tox profiles for py38,pep8,docs and cover tests

Change-Id: I960326fb0ab8d98cc3f62ffa638286e4fdcbb7c7
2023-06-02 16:01:08 +00:00

94 lines
2.2 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Ceph Maintenance
This document provides procedures for maintaining Ceph OSDs.
## Check OSD Status
To check the current status of OSDs, execute the following.
```
utilscli osd-maintenance check_osd_status
```
## OSD Removal
To purge OSDs that are in the down state, execute the following.
```
utilscli osd-maintenance osd_remove
```
## OSD Removal by OSD ID
To purge down OSDs by specifying OSD ID, execute the following.
```
utilscli osd-maintenance remove_osd_by_id --osd-id <OSDID>
```
## Reweight OSDs
To adjust an OSDs crush weight in the CRUSH map of a running cluster,
execute the following.
```
utilscli osd-maintenance reweight_osds
```
## Replace a Failed OSD
If a drive fails, follow these steps to replace a failed OSD.
1. Disable the OSD pod on the host to keep it from being rescheduled.
```
kubectl label nodes --all ceph_maintenance_window=inactive
```
2. Below, replace `<NODE>` with the name of the node where the failed OSD pods exist.
```
kubectl label nodes <NODE> --overwrite ceph_maintenance_window=active
```
3. Below, replace `<POD_NAME>` with the failed OSD pod name.
```
kubectl patch -n ceph ds <POD_NAME> -p='{"spec":{"template":{"spec":{"nodeSelector":{"ceph-osd":"enabled","ceph_maintenance_window":"inactive"}}}}}'
```
Complete the recovery by executing the following commands from the Ceph utility container.
1. Capture the failed OSD ID. Check for status `down`.
```
utilscli ceph osd tree
```
2. Remove the OSD from the cluster. Below, replace
`<OSD_ID>` with the ID of the failed OSD.
```
utilscli osd-maintenance osd_remove_by_id --osd-id <OSD_ID>
```
3. Remove the failed drive and replace it with a new one without bringing down
the node.
4. Once the new drive is in place, change the label and delete the OSD pod that
is in the `error` or `CrashLoopBackOff` state. Below, replace `<POD_NAME>`
with the failed OSD pod name.
```
kubectl label nodes <NODE> --overwrite ceph_maintenance_window=inactive
kubectl delete pod <POD_NAME> -n ceph
```
Once the pod is deleted, Kubernetes will re-spin a new pod for the OSD.
Once the pod is up, the OSD is added to the Ceph cluster with a weight equal
to `0`. Re-weight the OSD.
```
utilscli osd-maintenance reweight_osds
```