The current Graceful Recovery handler is not properly handling
back-to-back Multi Node Failure Avoidance (MNFA) events.
There are two phases to MNFA
phase 1: waiting for number of failed nodes to fall below
mnfa_threahold as each affected node's heartbeat
is recovered.
phase 2: then a Graceful Recovery Wait period which is an
11 second heartbeat soak to verify that a stable
heartbeat is regained before declaring the NMFA
event complete.
The Graceful Recovery Wait status of one or more affected nodes
has been seen to be left uncleared (stuck) on one or more of the
affected nodes if phase 2 of MNFA is interrupted by another MNFA
event ; aka MNFA Nesting.
Although this stuck status is not service affecting it does leave
one or more nodes' host.task field, as observed under host-show,
with "Graceful Recovery Wait" rather than empty.
This update makes Multi Node Failure Avoidance (MNFA) handling
changes to ensure that, upon MNFA exit, the recovery handler
is properly restarted if MNFA Nesting occurs.
Two additional Graceful Recovery phase issues were identified
and fixed by this update.
1. Cut Graceful recovery handling in half
- Found and removed a redundant 11 second heartbeat soak
at the very end of the recovery handler.
- This cuts the graceful recovery handling time down from
22 to 11 seconds thereby cutting potential for nesting
in half.
2. Increased supported Graceful Recovery nesting from 3 to 5
- Found that some links bounce more than others so a nesting
count of 3 can lead to an occasional single node failure.
- This adds a bit more resiliency to MNFA handling of cases
that exhibit more link messaging bounce.
Test Plan: Verified 60+ MNFA occurrences across 4 different
system types including AIO plus, Standard and Storage
PASS: Verify Single Node Graceful Recovery Handling
PASS: Verify Multi Node Graceful Recovery Handling
PASS: Verify Single Node Graceful Recovery Nesting Handling
PASS: Verify Multi Node Graceful Recovery Nesting Handling
PASS: Verify MNFA of up to 5 nests can be gracefully recovered
PASS: Verify MNFA of 6 nests lead to full enable of affected nodes
PASS: Verify update as a patch
PASS: Verify mtcAgent logging
Regression:
PASS: Verify standard system install
PASS: Verify product verification maintenance regression (4 runs)
PASS: Verify MNFA threshold increase and below threshold behavior
PASS: Verify MNFA with reduced timeout behavior for
... nested case that does not timeout
... case that does not timeout
... case that does timeout
Closes Bug: 1892877
Change-Id: I6b7d4478b5cae9521583af78e1370dadacd9536e
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>