
Closes-Bug: 1806963 In the case where the active controller experiences a spontaneous reboot failure there is the potential for a race condition in the new Active-Active Heartbeat model between the inactive hbsAgent and mtcAgent starting up on the newly active controller. The inactive hbsAgent can report a heartbeat Loss before SM starts up the mtcAgent. This results in a no detect of the of a heartbeat failed host. This update modifies the hbsAgent to continue to report heartbeat Loss at a throttled rate while the hbsAgent continues to experience heartbeat loss of enabled monitored hosts. This change is implemented in nodeClass.cpp. Debug of this issue also revealed another undesirable race condition and logging issue when a controller is locked. This issue is remedied with the introduction of a control structure 'locked' state that is set on controller lock and looked at in the hbs_cluster_update utility. hbsCluster.cpp Two additional hbsAgent logging changes were implemented with this update. 1. Only print "missing peer controller cluster view" on a state change event. Otherwise, this becomes excessive whenever the inactive controller fails. hbsAgent.cpp 2. Don't print the full heartbeat inventory and state banner with hbsInv.print_node_info on every heartbeat Loss event. Otherwise, this becomes excessive in larget systems. hbsCluster.cpp Test Plan: PASS: Verify hbsAgent log stream for implemented improvements. PASS: Verify Lock inactive controller several times. PASS: Fail inactive controller several times. verify detect. PASS: Reboot active controller several times. verify detect. PASS: DOR System several times. Verify proper recovery. PASS: DOR system but prevent power-up of several hosts. Verify detect. Change-Id: I36e6309e141e9c7844b736cce0cf0cddff3eb588 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
stx-metal
StarlingX Bare Metal Management
Description
Languages
C++
82.9%
Shell
10.2%
Python
3.3%
C
2.6%
Makefile
1%