From 535c28b67da1b0d67e308e07ea642a02208cd02f Mon Sep 17 00:00:00 2001 From: Dmitry Tantsur Date: Mon, 27 Sep 2021 11:24:04 +0200 Subject: [PATCH] Document recovery from power faults Change-Id: I95dbbbf0f2cb7e75d3f1c872ffccad99365df321 --- doc/source/admin/power-sync.rst | 30 ++++++++++++++++++++++++++++ doc/source/admin/troubleshooting.rst | 7 ++----- 2 files changed, 32 insertions(+), 5 deletions(-) diff --git a/doc/source/admin/power-sync.rst b/doc/source/admin/power-sync.rst index f4d10aa3c9..47d7c9459b 100644 --- a/doc/source/admin/power-sync.rst +++ b/doc/source/admin/power-sync.rst @@ -89,3 +89,33 @@ compute service. power state change event is received from the baremetal service in which case the power state from compute service's database will be forced on the node. + +.. _power-fault: + +Power fault and recovery +======================== + +When `Baremetal Power Sync`_ is enabled, and the Bare Metal service loses +access to a node (usually because of invalid credentials, BMC issues or +networking interruptions), the node enters ``maintenance`` mode and its +``fault`` field is set to ``power failure``. The exact reason is stored in the +``maintenance_reason`` field. + +As always with maintenance mode, only a subset of operations will work on such +nodes, and both the Compute service and the Ironic's native allocation API will +refuse to pick them. Any in-progress operations will either pause or fail. + +The conductor responsible for the node will try to recover the connection +periodically (with the interval configured by the +:oslo.config:option:`conductor.power_failure_recovery_interval` option). If the +power sync is successful, the ``fault`` field is unset and the node leaves the +maintenance mode. + +.. note:: + This only applies to automatic maintenance mode with the ``fault`` field + set. Maintenance mode set manually is never left automatically. + +Alternatively, you can disable maintenance mode yourself once the problem is +resolved:: + + baremetal node maintenance unset diff --git a/doc/source/admin/troubleshooting.rst b/doc/source/admin/troubleshooting.rst index c7347cbd8f..0cfff289be 100644 --- a/doc/source/admin/troubleshooting.rst +++ b/doc/source/admin/troubleshooting.rst @@ -33,11 +33,8 @@ A few things should be checked in this case: baremetal node provide The Bare metal service automatically puts a node in maintenance mode if - there are issues with accessing its management interface. Check the power - credentials (e.g. ``ipmi_address``, ``ipmi_username`` and ``ipmi_password``) - and then move the node out of maintenance mode:: - - baremetal node maintenance unset + there are issues with accessing its management interface. See + :ref:`power-fault` for details. The ``node validate`` command can be used to verify that all required fields are present. The following command should not return anything::