Merge "Actualize ironic-related troubleshooting bits"
This commit is contained in:
commit
ae105d9a60
@ -1,27 +1,56 @@
|
||||
Troubleshooting Node Management Failures
|
||||
----------------------------------------
|
||||
========================================
|
||||
|
||||
Where Are the Logs?
|
||||
^^^^^^^^^^^^^^^^^^^
|
||||
-------------------
|
||||
|
||||
Some logs are stored in *journald*, but most are stored as text files in
|
||||
``/var/log``. Ironic and ironic-inspector logs are stored in journald. Note
|
||||
that Ironic has 2 units: ``openstack-ironic-api`` and
|
||||
``openstack-ironic-conductor``. Similarly, ironic-inspector has
|
||||
``openstack-ironic-inspector`` and ``openstack-ironic-inspector-dnsmasq``. So
|
||||
for example to get all ironic-inspector logs use::
|
||||
``/var/log``. They are only accessible by the root user.
|
||||
|
||||
sudo journalctl -u openstack-ironic-inspector -u openstack-ironic-inspector-dnsmasq
|
||||
ironic-inspector
|
||||
~~~~~~~~~~~~~~~~
|
||||
|
||||
If something fails during the introspection ramdisk run, ironic-inspector
|
||||
stores the ramdisk logs in ``/var/log/ironic-inspector/ramdisk/`` as
|
||||
gz-compressed tar files. File names contain date, time and IPMI address of the
|
||||
node if it was detected (only for bare metal).
|
||||
The introspection logs (from ironic-inspector) are located in
|
||||
``/var/log/ironic-inspector``. If something fails during the introspection
|
||||
ramdisk run, ironic-inspector stores the ramdisk logs in
|
||||
``/var/log/ironic-inspector/ramdisk/`` as gz-compressed tar files.
|
||||
File names contain date, time and IPMI address of the node if it was detected
|
||||
(only for bare metal).
|
||||
|
||||
To collect introspection logs on success as well, set
|
||||
``always_store_ramdisk_logs = true`` in
|
||||
``/etc/ironic-inspector/inspector.conf``, restart the
|
||||
``openstack-ironic-inspector`` service and retry the introspection.
|
||||
|
||||
.. _ironic_logs:
|
||||
|
||||
ironic
|
||||
~~~~~~
|
||||
|
||||
The deployment logs (from ironic) are located in ``/var/log/ironic``. If
|
||||
something goes wrong during deployment or cleaning, the ramdisk logs are
|
||||
stored in ``/var/log/ironic/deploy``. See `ironic logs retrieving documentation
|
||||
<https://docs.openstack.org/ironic/latest/admin/troubleshooting.html#retrieving-logs-from-the-deploy-ramdisk>`_
|
||||
for more details.
|
||||
|
||||
.. _node_registration_problems:
|
||||
|
||||
Node Registration Problems
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
Node Registration and Management Problems
|
||||
-----------------------------------------
|
||||
|
||||
Nodes in enroll state after registration
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
If you see your nodes staying in the ``enroll`` provision state after the
|
||||
registration process (which may hang due to this), it means that Ironic is
|
||||
unable to verify power management credentials, and you need to fix them.
|
||||
Check the ``pm_addr``, ``pm_user`` and ``pm_password`` fields in your
|
||||
``instackenv.json``. In some cases (e.g. when using
|
||||
:doc:`../environments/virtualbmc`) you also need a correct ``pm_port``.
|
||||
Update the node as explained in `Fixing invalid node information`_.
|
||||
|
||||
Fixing invalid node information
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Any problems with node data registered into Ironic can be fixed using the
|
||||
Ironic CLI.
|
||||
@ -31,16 +60,16 @@ For example, a wrong MAC can be fixed in two steps:
|
||||
* Find out the assigned port UUID by running
|
||||
::
|
||||
|
||||
ironic node-port-list <NODE UUID>
|
||||
openstack baremetal port list --node <NODE UUID>
|
||||
|
||||
* Update the MAC address by running
|
||||
::
|
||||
|
||||
ironic port-update <PORT UUID> replace address=<NEW MAC>
|
||||
openstack baremetal port set --address=<NEW MAC> <PORT UUID>
|
||||
|
||||
A Wrong IPMI address can be fixed with the following command::
|
||||
|
||||
ironic node-update <NODE UUID> replace driver_info/ipmi_address=<NEW IPMI ADDRESS>
|
||||
openstack baremetal node set <NODE UUID> --driver-info ipmi_address=<NEW IPMI ADDRESS>
|
||||
|
||||
Node power state is not enforced by Ironic
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
@ -58,21 +87,58 @@ Also, note that if ``openstack undercloud install`` is re-run the value of
|
||||
the ``force_power_state_during_sync`` configuration option will be set back to
|
||||
the default, which is ``False``.
|
||||
|
||||
How do I repair broken nodes
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Usually, the nodes should only be deleted when the hardware is decomissioned.
|
||||
Before that, you're expected to remove instances from them using scale-down.
|
||||
However, in some cases, it may be impossible to repair a node with e.g. broken
|
||||
power management, and it gets stuck in an abnormal state.
|
||||
|
||||
.. warning::
|
||||
Before proceeding with this section, always try to decomission a node
|
||||
normally, by scaling down your cloud. Forcing node deletion may cause
|
||||
unpredicable results.
|
||||
|
||||
Ironic requires that nodes that cannot be operated normally are put in the
|
||||
maintenance mode. It is done by the following command::
|
||||
|
||||
openstack baremetal node maintenance set <NODE UUID> --reason="<EXPLANATION>"
|
||||
|
||||
Ironic will stop checking power and health state for such nodes, and Nova will
|
||||
not pick them for deployment. Power command will still work on them, though.
|
||||
|
||||
After a node is in the maintenance mode, you can attempt repairing it, e.g. by
|
||||
`Fixing invalid node information`_. If you manage to make the node operational
|
||||
again, move it out of the maintenance mode::
|
||||
|
||||
openstack baremetal node maintenance unset <NODE UUID>
|
||||
|
||||
If repairing is not possible, you can force deletion of such node::
|
||||
|
||||
openstack baremetal node delete <NODE UUID>
|
||||
|
||||
Forcing node removal will leave it powered on, accessing the network with
|
||||
the old IP address(es) and with all services running. Before proceeding, make
|
||||
sure to power it off and clean up via any means.
|
||||
|
||||
After that, the associated Nova instance is orphaned, and must be deleted.
|
||||
You can do it normally via the scale down procedure.
|
||||
|
||||
.. _introspection_problems:
|
||||
|
||||
Hardware Introspection Problems
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
-------------------------------
|
||||
|
||||
Introspection hangs and times out
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
ironic-inspector times out introspection process after some time (defaulting to
|
||||
1 hour) if it never gets response from the introspection ramdisk. This can be
|
||||
a sign of a bug in the introspection ramdisk, but usually it happens due to
|
||||
environment misconfiguration, particularly BIOS boot settings. Please refer to
|
||||
`ironic-inspector troubleshooting documentation
|
||||
<http://docs.openstack.org/developer/ironic-inspector/troubleshooting.html>`_
|
||||
<https://docs.openstack.org/ironic-inspector/latest/user/troubleshooting.html>`_
|
||||
for information on how to detect and fix such problems.
|
||||
|
||||
Accessing the ramdisk
|
||||
@ -89,7 +155,7 @@ manually. Find the line starting with "kernel" and append rootpwd="HASH" to it.
|
||||
Do not append the real password. Alternatively, you can append
|
||||
sshkey="PUBLIC_SSH_KEY" with your public SSH key.
|
||||
|
||||
.. note::
|
||||
.. warning::
|
||||
In both cases quotation marks are required!
|
||||
|
||||
When ramdisk is running, figure out its IP address by checking ``arp`` utility
|
||||
@ -107,32 +173,19 @@ SSH as a root user with the temporary password or the SSH key.
|
||||
image with the selinux-permissive element for diskimage-builder or by
|
||||
passing selinux=0 in the kernel command line.
|
||||
|
||||
Accessing logs from the ramdisk
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Introspection logs are saved on ramdisk failures. Starting with the Newton
|
||||
release, they are actually stored on all introspection failures. The standard
|
||||
location is ``/var/log/ironic-inspector/ramdisk``, and the files there are
|
||||
actually ``tar.gz`` without an extension.
|
||||
|
||||
To collect introspection logs in other cases, set
|
||||
``always_store_ramdisk_logs = true`` in
|
||||
``/etc/ironic-inspector/inspector.conf``, restart the
|
||||
``openstack-ironic-inspector`` service and retry the introspection.
|
||||
|
||||
Refusing to introspect node with provision state "available"
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
If you're running introspection directly using ironic-inspector CLI (or in case
|
||||
of bugs in our scripts), a node can be in the "AVAILABLE" state, which is meant
|
||||
for deployment, not for introspection. You should advance node to the
|
||||
"MANAGEABLE" state before introspection and move it back before deployment.
|
||||
Please refer to `upstream node states documentation
|
||||
<http://docs.openstack.org/developer/ironic-inspector/usage.html#node-states>`_
|
||||
<https://docs.openstack.org/ironic-inspector/latest/user/usage.html#node-states>`_
|
||||
for information on how to fix it.
|
||||
|
||||
How can introspection be stopped?
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Introspection for a node can be stopped with the following command::
|
||||
|
||||
|
@ -31,8 +31,8 @@ Next, there are a few layers on which the deployment can fail:
|
||||
* Post-deploy configuration (Puppet)
|
||||
|
||||
As Ironic service is in the middle layer, you can use its shell to guess the
|
||||
failed layer. Issue ``ironic node-list`` command to see all registered nodes
|
||||
and their current status, you will see something like::
|
||||
failed layer. Issue ``openstack baremetal node list`` command to see all
|
||||
registered nodes and their current status, you will see something like::
|
||||
|
||||
+--------------------------------------+------+---------------+-------------+-----------------+-------------+
|
||||
| UUID | Name | Instance UUID | Power State | Provision State | Maintenance |
|
||||
@ -45,13 +45,21 @@ Pay close attention to **Provision State** and **Maintenance** columns
|
||||
in the resulting table.
|
||||
|
||||
* If the command shows empty table or less nodes that you expect, or
|
||||
**Maintenance** is ``True``, or **Provision State** is ``manageable``,
|
||||
there was a problem during node enrolling and introspection.
|
||||
Please go back to these steps.
|
||||
**Maintenance** is ``True``, or **Provision State** is ``manageable``
|
||||
or ``enroll``, there was a problem during node enrolling and introspection.
|
||||
|
||||
You can check the actual cause using the following command::
|
||||
|
||||
openstack baremetal node show <UUID> -f value -c maintenance_reason
|
||||
|
||||
For example, **Maintenance** goes to ``True`` automatically, if wrong power
|
||||
credentials are provided.
|
||||
|
||||
Fix the cause of the failure, then move the node out of the maintenance
|
||||
mode::
|
||||
|
||||
openstack baremetal node maintenance unset <NODE UUID>
|
||||
|
||||
* If **Provision State** is ``available`` then the problem occurred before
|
||||
bare metal deployment has even started. Proceed with `Debugging Using Heat`_.
|
||||
|
||||
@ -65,16 +73,12 @@ in the resulting table.
|
||||
changes.
|
||||
|
||||
* If **Provision State** is ``error`` or ``deploy failed``, then bare metal
|
||||
deployment has failed for this node. Issue
|
||||
::
|
||||
deployment has failed for this node. Look at the **last_error** field::
|
||||
|
||||
ironic node-show <UUID>
|
||||
openstack baremetal node show <UUID> -f value -c last_error
|
||||
|
||||
and look for **last_error** field. It will contain error description.
|
||||
|
||||
If the error message is vague, you can use logs to clarify it::
|
||||
|
||||
sudo journalctl -u openstack-ironic-conductor -u openstack-ironic-api
|
||||
If the error message is vague, you can use logs to clarify it, see
|
||||
:ref:`ironic_logs` for details.
|
||||
|
||||
If you see wait timeout error, and node **Power State** is ``power on``,
|
||||
then try to connect to the virtual console of the failed machine. Use
|
||||
@ -248,8 +252,13 @@ Start with checking `Ironic troubleshooting guide on this topic
|
||||
|
||||
If you're using advanced profile matching with multiple flavors, make sure
|
||||
you have enough nodes corresponding to each flavor/profile. Watch
|
||||
``capabilities`` key in ``properties`` field for ``ironic node-show``.
|
||||
It should contain e.g. ``profile:compute``.
|
||||
``capabilities`` key in the output of
|
||||
|
||||
::
|
||||
|
||||
openstack baremetal node show <UUID> --fields properties
|
||||
|
||||
It should contain e.g. ``profile:compute`` for compute nodes.
|
||||
|
||||
|
||||
Debugging OpenStack services
|
||||
|
Loading…
Reference in New Issue
Block a user