porthole/docs/ceph_maintenance.md
Venkata, Krishna (kv988c) 68ea0f9bfa [ceph]: Added procedure to stop the osd pod from being scheduled
Change-Id: I7d39f5fdfe9a198baaadfc0f56fbf7b7d0a8fc6b
2019-08-22 14:35:23 -05:00

80 lines
2.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Ceph Maintenance
This MOP covers Maintenance Activities related to Ceph.
## Table of Contents ##
<!-- TOC depthFrom:1 depthTo:6 withLinks:1 updateOnSave:1 orderedList:0 -->
- Table of Contents
- 1. Generic commands
- 2. Replace failed OSD
## 1. Generic Commands ##
### Check OSD Status
To check the current status of OSDs, execute the following:
```
utilscli osd-maintenance check_osd_status
```
### OSD Removal
To purge OSDs in down state, execute the following:
```
utilscli osd-maintenance osd_remove
```
### OSD Removal By OSD ID
To purge OSDs by OSD ID in down state, execute the following:
```
utilscli osd-maintenance remove_osd_by_id --osd-id <OSDID>
```
### Reweight OSDs
To adjust an OSDs crush weight in the CRUSH map of a running cluster, execute the following:
```
utilscli osd-maintenance reweight_osds
```
## 2. Replace failed OSD ##
In the context of a failed drive, Please follow below procedure.
Disable OSD pod on the host from being rescheduled
kubectl label nodes --all ceph_maintenance_window=inactive
Replace `<NODE>` with the name of the node were the failed osd pods exist.
kubectl label nodes <NODE> --overwrite ceph_maintenance_window=active
Replace `<POD_NAME>` with failed OSD pod name
kubectl patch -n ceph ds <POD_NAME> -p='{"spec":{"template":{"spec":{"nodeSelector":{"ceph-osd":"enabled","ceph_maintenance_window":"inactive"}}}}}'
Following commands should be run from utility container
Capture the failed OSD ID. Check for status `down`
utilscli ceph osd tree
Remove the OSD from Cluster. Replace `<OSD_ID>` with above captured failed OSD ID
utilscli osd-maintenance osd_remove_by_id --osd-id <OSD_ID>
Remove the failed drive and replace it with a new one without bringing down the node.
Once new drive is placed, change the label and delete the concern OSD pod in `error` or `CrashLoopBackOff` state. Replace `<POD_NAME>` with failed OSD pod name.
kubectl label nodes <NODE> --overwrite ceph_maintenance_window=inactive
kubectl delete pod <POD_NAME> -n ceph
Once pod is deleted, kubernetes will re-spin a new pod for the OSD. Once Pod is up, the osd is added to ceph cluster with weight equal to `0`. we need to re-weight the osd.
utilscli osd-maintenance reweight_osds