655ab05b71
For AIO-DX, Ceph monitor was not being started after an uncontrolled swact caused by sudden power off/reboot of the active controller, breaking the system high availability. This happens because there is a flag to indicate on which controller the last active ceph monitor was running to prevent starting ceph monitor without drbd-cephmon data in sync, what could cause Ceph data corruption. That flag was also avoiding data corruption caused when mgmt network was down and both controllers were set to be active, starting ceph monitor without drbd-cephmon in sync. To prevent data corruption and to maintain system high availability, this fix checks the mgmt network carrier instead of managing flags. If no carrier is detected on mgmt network interface, then ceph mon and osd are stopped and only allowed to start again after mgmt network has carrier. For the AIO-DX Direct, all networks are also verified. If all networks have no carrier, then the other controller is considered down, letting the working controller to be in active state even if mgmt network has no carrier. Test-Plan: PASS: Run system host-swact on AIO-DX and verify ceph is running with status HEALTH_OK PASS: Force an uncontrolled swact on AIO-DX by killing a critical process and verify if ceph is running with status HEALTH_OK PASS: Disconnect OAM and MGMT networks for both controllers on AIO-DX and verify ceph mon and osd stop on both controllers. Reconnect OAM and MGMT networks and verify if ceph is running and status is HEALTH_OK PASS: Reboot or power off active controller and verify on the other controller if ceph is running with status HEALT_WARN because one host is down. Power on the controller, wait until it is online/available. Verify if ceph HEALTH_OK after data is all ODSs are up and data is recovered. Closes-bug: 2020889 Signed-off-by: Felipe Sanches Zanoni <Felipe.SanchesZanoni@windriver.com> Change-Id: I38470f43eba86f88fb9cfe47869d2393cacbd365 |
||
---|---|---|
.. | ||
centos | ||
debian | ||
files |