Performing a forced reboot of the active controller sometimes
results in a second reboot of that controller. The cause of the
second reboot was due to its reported uptime in the first mtcAlive
message, following the reboot, as greater than 10 minutes.
Maintenance has a long standing graceful recovery threshold of
10 minutes. Meaning that if a host looses heartbeat and enters
Graceful Recovery, if the uptime value extracted from the first
mtcAlive message following the recovery of that host exceeds 10
minutes, then maintenance interprets that the host did not reboot.
If a host goes absent for longer than this threshold then for
reasons not limited to security, maintenance declares the host
as 'failed' and force re-enables it through a reboot.
With the introduction of containers and addition of new features
over the last few releases, boot times on some servers are
approaching the 10 minute threshold and in this case exceeded
the threshold.
The primary fix in this update is to increase this long standing
threshold to 15 minutes to account for evolution of the product.
During the debug of this issue a few other related undesirable
behaviors related to Graceful Recovery were observed with the
following additional changes implemented.
- Remove hbsAgent process restart in ha service management
failover failure recovery handling. This change is in the
ha git with a loose dependency placed on this update.
Reason: https://review.opendev.org/c/starlingx/ha/+/788299
- Prevent the hbsAgent from sending heartbeat clear events
to maintenance in response to a heartbeat stop command.
Reason: Maintenance receiving these clear events while in
Graceful Recovery causes it to pop out of graceful
recovery only to re-enter as a retry and therefore
needlessly consumes one (of a max of 5) retry count.
- Prevent successful Graceful Recovery until all heartbeat
monitored networks recover.
Reason: If heartbeat of one network, say cluster recovers but
another (management) does not then its possible the
max Graceful Recovery Retries could be reached quite
quickly, while one network recovered but the other
may not have, causing maintenance to fail the host and
force a full enable with reboot.
- Extend the wait for the hbsClient ready event in the graceful
recovery handler timout from 1 minute to worker config timeout.
Reason: To give the worker config time to complete before force
starting the recovery handler's heartbeat soak.
- Add Graceful Recovery Wait state recovery over process restart.
Reason: Avoid double reboot of Gracefully Recovering host over
SM service bounce.
- Add requirement for a valid out-of-band mtce flags value before
declaring configuration error in the subfunction enable handler.
Reason: rebooting the active controller can sometimes result in
a falsely reported configation error due to the
subfunction enable handler interpreting a zero value as
a configuration error.
- Add uptime to all Graceful Recovery 'Connectivity Recovered' logs.
Reason: To assist log analysis and issue debug
Test Plan:
PASS: Verify handling active controller reboot
cases: AIO DC, AIO DX, Standard, and Storage
PASS: Verify Graceful Recovery Wait behavior
cases: with and without timeout, with and without bmc
cases: uptime > 15 mins and 10 < uptime < 15 mins
PASS: Verify Graceful Recovery continuation over mtcAgent restart
cases: peer controller, compute, MNFA 4 computes
PASS: Verify AIO DX and DC active controller reboot to standby
takeover that up for less than 15 minutes.
Regression:
PASS: Verify MNFA feature ; 4 computes in 8 node Storage system
PASS: Verify cluster network only heartbeat loss handling
cases: worker and standby controller in all systems.
PASS: Verify Dead Office Recovery (DOR)
cases: AIO DC, AIO DX, Standard, Storage
PASS: Verify system installations
cases: AIO SX/DC/DX and 8 node Storage system
PASS: Verify heartbeat and graceful recovery of both 'standby
controller' and worker nodes in AIO Plus.
PASS: Verify logging and no coredumps over all of testing
PASS: Verify no missing or stuck alarms over all of testing
Change-Id: I3d16d8627b7e838faf931a3c2039a6babf2a79ef
Closes-Bug: 1922584
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>