
This update adds alarm handling to the recently introduced pxeboot network mtcAlive messaging, see depends on review below. A new 200.003 maintenance alarm is introduced with the second depends on update below. This new alarm is MINOR but also Management Affecting because the pxeboot network is required for node installation. This update enhances the new pxeboot_mtcAlive_monitor FSM for the purpose of detecting pxeboot mtcAlive message loss, alarming and then clearing the alarm once pxceboot mtcAlive messaging resumes. The new alarm assertion and clear is debounced: - alarm is asserted if message loss persists to the accumulation of 12 missed messages or after 2 minutes of complete message loss. - alarm is cleared after decrementing the message missed counter to zero or 1 minute of loss-less messaging. Upgrades are supported with the addition of a features list to the mtcClient ready event. All new mtcClients that support pxeboot network messaging now publish pxeboot mtcAlive support through this new features list. This is rendered in the logs like this: <hostname> mtcClient ready ; with pxeboot mtcAlive support The mtcAgent does not expect/monitor pxeboot mtcAlive messages from hosts that don't publish the feature support. Test Plan: PASS: Verify mtcAlive period is 5 seconds. PASS: Verify pxeboot mtcAlive monitor period is 10 seconds. PASS: Verify mtcAgent sends mtcClient a mtcAlive request on every mtcAlive monitor miss. PASS: Verify pxeboot mtcAlive alarm is not raised while a node is locked. Alarm attributes: PASS: Verify severity is minor. PASS: Verify alarm is cleared while node is locked. PASS: Verify alarm can be suppressed while unlocked. PASS: Verify asserted alarm is management affecting. PASS: Verify alarm-show output format including cause and repair action text. Process Restart Handling: PASS: Verify alarm is maintained over a mtcAgent process restart. PASS: Verify pxeboot monitoring resumes with or without asserted alarm immediately following a mtcAgent process restart. PASS: Verify mtcClient learns and starts pxeboot mtcAlive messaging immediately following mtcClient process restart for locked or unlocked nodes. Alarm Debounce Handling: PASS: Verify alarm assertion only after 2 minutes of mtcAlive loss. PASS: Verify alarm clear after 1 minutes of mtcAlive recovery. PASS: Verify assertion and recovery debounce logging. PASS: Verify alarm management miss and loss controls handle all boundary conditions exercised by a 12 hr soak with randomized period between message loss and recovery. Host Action Handling: PASS: Verify mtcAlive alarm is not raised over a Host Unlock Enable. PASS: Verify mtcAlive alarm is not raised over a Host Graceful Recovery. PASS: Verify mtcAlive alarm is not raised over a Host Power Off/On. PASS: Verify mtcAlive alarm is not raised over a Host Reboot/Reset. PASS: Verify mtcAlive alarm is not raised over a Host Reinstall. PASS: Verify pxeboot mtcAlive is factored into Host Offline Handling. PASS: Verify pxeboot alarm handling for node that does not send pxeboot mtcAlive after unlock. Stuck Alarm Avoidance Handling: PASS: Verify typical alarm assertion and clear handling. PASS: Verify alarm is maintained or cleared over node reboot if the messaging issue persists or resolves over the reboot recovery. PASS: Verify mtcAlive alarm is maintained over a Swact and cleared if the messaging is ok on the newly active controller. PASS: Verify mtcAlive alarm assertion recovery case over uncontrolled Swact due to active controller reboot. PASS: Verify alarm is cleared over a spontaneous reboot if pxeboot messaging recovers over that reboot. Upgrades Case: PASS: Verify pxeboot mtcAlive monitoring only occurs on mtcClients that actually support pxeboot network mtcAlive monitoring. PASS: Verify mtcClient new features list, parsing which enables pxeboot mtcAlive monitoring for that node. PASS: Verify pxeboot mtcAlive messaging monitoring is not enabled towards nodes whose mtcClient does publish pxeboot mtcAlive messaging feature support. PROG: Verify AIO DX upgrade from 22.12 to current master branch. Focus on pxeboot messaging over the upgrade process. Depends-On: https://review.opendev.org/c/starlingx/metal/+/912654 Depends-On: https://review.opendev.org/c/starlingx/fault/+/914660 Story: 2010940 Task: 49542 Change-Id: I1b51ad9ebcf010f5dee9a86c0295be3da6e2f9b1 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Description
Languages
C++
83%
Shell
10.2%
Python
3.3%
C
2.5%
Makefile
1%