StarlingX Bare Metal and Node Management, Hardware Maintenance
Go to file
Eric MacDonald 1335bc484d Add auto run goenabled and start hosts services to mtcClient
The 'mtcClient' currently automatically runs the main function's
'goenabled' scripts on process startup for all nodes if and when
their run preconditions are met.

However, that is not true for 'start host services' and, in the AIO
system type case, the subfunction 'goenabled' scripts.

Typically, this is acceptable because the 'mtcAgent' will request
these scripts to be run during unlock and failure recovery scenarios.

However, if the system administrator reconfigures the maintenance
heartbeat fault handling action from the default 'fail' to any other
setting [degrade,alarm,none] and a node reboots outside of maintenance
control, then upon reboot recovery, the 'start host services' and,
if the node is an AIO controller, the required subfunction 'goenabled'
scripts are not executed. In such a case, the missing subfunction
'goenabled' flag file (/var/run/goenabled_subf) prevents the hbsAgent
and hbsClient on that node from entering its in-service mode of
operation. Instead they run waiting for the node's In-Test phase to
complete ; which never happens.

This can lead to what appears to be suck maintenance heartbeat alarms.
However, its really caused by the maintenance heartbeat processes on
that node gated from performing their mission mode function.

The /var/run/goenabled_subf flag file is the AIO In-Test complete gate.
It is set if the subfunction 'goenabled' tests pass. However, because
this flag file is in /var/run (a volatile directory) it is lost/cleared
over a reboot.

This update adds the automatic execution of the AIO controller's
subfunction 'goenabled' scripts and the 'start host services' for
all nodes. Once all the required preconditions are met the scripts
are run and that node is ready for service, regardless of how and
the conditions underwhich it rebooted.

Testing of this update is focused on
- Verifying the originating issue is resolved.
- Verify the changed behavior over the install of all system types.
- Verify the changed behavior with an uncontrolled reboot or each
  node type for all the supported maintenance heartbeat failure
  action modes.

Test Plan:

PASS: Verify install of the following system types
PASS: - AIO SX
PASS: - AIO DX and AIO DX Plus
PASS: - Standard DX with worker and storage nodes (vbox)
PASS: - System Controller with 1 subcloud (dc-libvirt)

PASS: Verify spontaneous reboot of unlocked active AIO controller with
PASS: - heartbeat_failure_action=fail
PASS: - heartbeat_failure_action=degrade
PASS: - heartbeat_failure_action=alarm
PASS: - heartbeat_failure_action=none

PASS: Verify spontaneous reboot of unlocked standby AIO controller with
PASS:  - heartbeat_failure_action=fail
PASS:  - heartbeat_failure_action=degrade
PASS:  - heartbeat_failure_action=alarm
PASS:  - heartbeat_failure_action=none

PASS: Verify reboot recovery after spontaneous reboot of worker
PASS: Verify reboot recovery after spontaneous reboot of storage
PASS: Verify start host services is run on mtcClient process startup.
PASS: Verify start host services is run on worker and storage nodes
      when rebooted with all heartbeat failure recovery action modes.

Regression:

PASS: Verify degrade and alarm management over in-service heartbeat
      failure while when heartbeat_failure_action=fail
PASS: Verify degrade and alarm management over in-service heartbeat
      failure while when heartbeat_failure_action=degrade
PASS: Verify degrade and alarm management over in-service heartbeat
      failure while when heartbeat_failure_action=alarm
PASS: Verify no alarm or degrade over in-service heartbeat
      failure while when heartbeat_failure_action=none
PASS: Verify mtcClint over AIO standby controller lock/unlock
PASS: Verify start host services is run on mtcClient on every node
      by command from mtcAgent process startup.
PASS: Verify start host services is run on mtcClient over a unlock or
      graceful recovery by command from mtcAgent.
PASS: Verify start host services check follows goenabled test
      completion on process startup.
PASS: Verify stop host services is run over a node lock.
PASS: Verify goenable main and subfunction failure handling
PASS: Verify start hosts service failure handling
PASS: Verify no coredump or crashdumps
PASS: Verify no stuck alarms

Closes-Bug: 2067917
Change-Id: Ie8aaf5da20b092267f637ad3df125019c244991b
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2024-06-04 19:42:54 +00:00
api-ref/source Fix reference to LLDP neighbors API path in the documentation 2024-02-22 16:32:47 -03:00
bsp-files Upversion to 24.09 2024-05-15 15:29:41 -03:00
devstack Security: Handle nospectre_v1 in the bootargs 2020-01-28 18:21:13 -05:00
doc Fix tox-docs failing sphinx 2023-08-29 16:50:22 -04:00
installer Fix kickstarts patching 2023-10-11 14:40:38 +00:00
kickstart Enable cloud-init services based on boot paremeter 2024-05-07 09:16:17 -04:00
mtce Add auto run goenabled and start hosts services to mtcClient 2024-06-04 19:42:54 +00:00
mtce-common Prevent process coredump due to missing token in response header 2024-04-29 13:11:26 +00:00
mtce-compute Remove qemu dependency from mtce-compute and mtce-control 2023-12-04 14:19:28 +00:00
mtce-control Add ipsec auth server pmon configuration 2024-02-09 16:05:18 -03:00
mtce-storage Update mtce debian package ver based on git 2023-03-02 14:50:35 +00:00
releasenotes Switch to newer openstackdocstheme and reno versions 2020-06-04 14:32:46 +02:00
tools Set longer shutdown time and fix power state error log 2023-10-05 17:12:19 -04:00
.gitignore Update tox.ini files to use stein constraints 2019-06-25 13:20:35 -04:00
.gitreview OpenDev Migration Patch 2019-04-19 19:52:33 +00:00
.zuul.yaml Fix github mirroring for this repo 2023-04-28 12:38:51 -04:00
centos_build_layer.cfg Build layering, add layer build config file 2019-10-15 19:19:45 +08:00
centos_iso_image.inc Remove unused inventory and python-inventoryclient 2020-01-08 14:12:05 -06:00
centos_pkg_dirs rvmc: remove un-used build data 2020-01-16 08:39:54 -08:00
centos_stable_docker_images.inc Utility to install a server via Redfish 2019-12-31 15:34:54 +00:00
CONTRIBUTORS.wrs StarlingX open source release updates 2018-05-31 07:36:43 -07:00
debian_build_layer.cfg Add debian_build_layer.cfg file 2021-10-05 14:08:23 -04:00
debian_iso_image.inc Debian: metal: update debian_iso_image.inc 2022-11-16 12:06:51 +08:00
debian_pkg_dirs Include upgrades meta files to Debian ISO 2022-08-02 21:01:58 +00:00
debian_stable_docker_images.inc debian: port rvmc docker image to Debian 2022-08-12 16:30:01 +00:00
LICENSE StarlingX open source release updates 2018-05-31 07:36:43 -07:00
pylint.rc Add pylint py3 portability checks for the metal repo 2021-09-13 11:57:42 -03:00
README.rst starlingx/metal README improvement 2023-07-19 12:32:13 -03:00
test-requirements.txt Removed wait_for_worker_config_init in AIO systems 2021-07-08 18:48:28 -04:00
tox.ini Update tox.ini to work with tox 4 2022-12-26 23:26:54 +00:00

metal

The starlingx/metal repository handles StarlingX Bare Metal Management1.

This repository is not intended to be developed standalone, but rather as part of the StarlingX Source System, which is defined by the StarlingX manifest2.

References


  1. https://docs.starlingx.io/api-ref/metal↩︎

  2. https://opendev.org/starlingx/manifest.git↩︎