Merge "Docs: Troubleshooting: how to exit clean failed"

2023-01-26 13:25:02 +00:00 · 2023-01-26 13:25:02 +00:00 · 324bb00cb1
commit 324bb00cb1
parent b63d15ccdb 8604a799aa
1 changed files with 44 additions and 0 deletions
--- a/doc/source/admin/troubleshooting.rst
+++ b/doc/source/admin/troubleshooting.rst
@ -1100,3 +1100,47 @@ of other variables, you may be able to leverage the `RAID <raid>`_
 configuration interface to delete volumes/disks, and recreate them. This may
 have the same effect as a clean disk, however that too is RAID controller
 dependent behavior.
+
+I'm in "clean failed" state, what do I do?
+==========================================
+
+There is only one way to exit the ``clean failed`` state. But before we visit
+the answer as to **how**, we need to stress the importance of attempting to
+understand **why** cleaning failed. On the simple side of things, this may be
+as simple as a DHCP failure, but on a complex side of things, it could be that
+a cleaning action failed against the underlying hardware, possibly due to
+a hardware failure.
+
+As such, we encourage everyone to attempt to understand **why** before exiting
+the ``clean failed`` state, because you could potentially make things worse
+for yourself. For example if firmware updates were being performed, you may
+need to perform a rollback operation against the physical server, depending on
+what, and how the firmware was being updated. Unfortunately this also borders
+the territory of "no simple answer".
+
+This can be counter balanced with sometimes there is a transient networking
+failure and a DHCP address was not obtained. An example of this would be
+suggested by the ``last_error`` field indicating something about "Timeout
+reached while cleaning the node", however we recommend following several
+basic troubleshooting steps:
+
+* Consult the ``last_error`` field on the node, utilizing the
+  ``baremetal node show <uuid>`` command.
+* If the version of ironic supports the feature, consult the node history
+  log, ``baremetal node history list`` and
+  ``baremetal node history get <uuid>``.
+* Consult the acutal console screen of the physical machine. *If* the ramdisk
+  booted, you will generally want to investigate the controller logs and see
+  if an uploaded agent log is being stored on the conductor responsible for
+  the baremetal node. Consult `Retrieving logs from the deploy ramdisk`_.
+  If the node did not boot for some reason, you can typically just retry
+  at this point and move on.
+
+How to get out of the state, once you've understood **why** you reached it
+in the first place, is to utilize the ``baremetal node manage <node_id>``
+command. This returns the node to ``manageable`` state, from where you can
+retry "cleaning" through automated cleaning with the ``provide`` command,
+or manual cleaning with ``clean`` command. or the next appropriate action
+in the workflow process you are attempting to follow, which may be
+ultimately be decommissioning the node because it could have failed and is
+being removed or replaced.