Merge "Docs: Troubleshooting: how to exit clean failed"

This commit is contained in:
Zuul 2023-01-26 13:25:02 +00:00 committed by Gerrit Code Review
commit 324bb00cb1

View File

@ -1100,3 +1100,47 @@ of other variables, you may be able to leverage the `RAID <raid>`_
configuration interface to delete volumes/disks, and recreate them. This may
have the same effect as a clean disk, however that too is RAID controller
dependent behavior.
I'm in "clean failed" state, what do I do?
==========================================
There is only one way to exit the ``clean failed`` state. But before we visit
the answer as to **how**, we need to stress the importance of attempting to
understand **why** cleaning failed. On the simple side of things, this may be
as simple as a DHCP failure, but on a complex side of things, it could be that
a cleaning action failed against the underlying hardware, possibly due to
a hardware failure.
As such, we encourage everyone to attempt to understand **why** before exiting
the ``clean failed`` state, because you could potentially make things worse
for yourself. For example if firmware updates were being performed, you may
need to perform a rollback operation against the physical server, depending on
what, and how the firmware was being updated. Unfortunately this also borders
the territory of "no simple answer".
This can be counter balanced with sometimes there is a transient networking
failure and a DHCP address was not obtained. An example of this would be
suggested by the ``last_error`` field indicating something about "Timeout
reached while cleaning the node", however we recommend following several
basic troubleshooting steps:
* Consult the ``last_error`` field on the node, utilizing the
``baremetal node show <uuid>`` command.
* If the version of ironic supports the feature, consult the node history
log, ``baremetal node history list`` and
``baremetal node history get <uuid>``.
* Consult the acutal console screen of the physical machine. *If* the ramdisk
booted, you will generally want to investigate the controller logs and see
if an uploaded agent log is being stored on the conductor responsible for
the baremetal node. Consult `Retrieving logs from the deploy ramdisk`_.
If the node did not boot for some reason, you can typically just retry
at this point and move on.
How to get out of the state, once you've understood **why** you reached it
in the first place, is to utilize the ``baremetal node manage <node_id>``
command. This returns the node to ``manageable`` state, from where you can
retry "cleaning" through automated cleaning with the ``provide`` command,
or manual cleaning with ``clean`` command. or the next appropriate action
in the workflow process you are attempting to follow, which may be
ultimately be decommissioning the node because it could have failed and is
being removed or replaced.