metal

History

Eric MacDonald 5c043f7ca9 Make Mtce ignore heartbeat events from in-active controller. There is the potential for a race condition that can lead to mtce incorrectly failing hosts due to heartbeat failure event messages sourced from the in-active controller. During a split brain recovery action scenario there was a swact which left the hbsAgent on the new stand-by controller thinking it was still on the active controller. This specific split brain failure mode was one where the active and then (after swact) stand-by controller was failing heartbeat to its peer and other nodes in the system even though the new active controller saw heartbeat working fine. The problem being, the in-active controller detected and sent a heartbeat loss message to mtce before mtce was able to update the in-active controller's heartbeat activity status which would have gated the loss event send. This update adds an additional layer of protection by intentionally ignoring heartbeat events from the in-active controller that might slip through due to this activity state change race condition. Also fixed a flooding log in the hbsAgent for big systems. Change-Id: I825a801166b3e80cbf67945c7f587851f4e0d90b Closes-Bug: 1813976 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2019-05-09 14:42:01 +00:00
..
build_srpm.data	Make Mtce ignore heartbeat events from in-active controller.	2019-05-09 14:42:01 +00:00
mtce.spec	Merge "Refactor infrastructure network in mtce code"	2019-04-23 21:12:41 +00:00

Eric MacDonald 5c043f7ca9 Make Mtce ignore heartbeat events from in-active controller.

There is the potential for a race condition that can lead to
mtce incorrectly failing hosts due to heartbeat failure event
messages sourced from the in-active controller.

During a split brain recovery action scenario there was a swact
which left the hbsAgent on the new stand-by controller thinking
it was still on the active controller.

This specific split brain failure mode was one where the active
and then (after swact) stand-by controller was failing heartbeat
to its peer and other nodes in the system even though the new
active controller saw heartbeat working fine.

The problem being, the in-active controller detected and sent
a heartbeat loss message to mtce before mtce was able to update
the in-active controller's heartbeat activity status which would
have gated the loss event send.

This update adds an additional layer of protection by intentionally
ignoring heartbeat events from the in-active controller that might
slip through due to this activity state change race condition.

Also fixed a flooding log in the hbsAgent for big systems.

Change-Id: I825a801166b3e80cbf67945c7f587851f4e0d90b
Closes-Bug: 1813976
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

2019-05-09 14:42:01 +00:00

build_srpm.data

Make Mtce ignore heartbeat events from in-active controller.

2019-05-09 14:42:01 +00:00

mtce.spec

Merge "Refactor infrastructure network in mtce code"

2019-04-23 21:12:41 +00:00