Actualize ironic-related troubleshooting bits
* New content on removing broken nodes via maintenance mode * Update the logs locations, add information on deployment logs * Fix headings in the nodes troubleshooting guide * Use modern (and more convenient in many cases) OSC commands Change-Id: I0a0c83cde14d25cf76b4b68b1c1696bbd73e69db
This commit is contained in:
parent
2388c54a03
commit
ea7ec0312e
@ -1,27 +1,56 @@
|
|||||||
Troubleshooting Node Management Failures
|
Troubleshooting Node Management Failures
|
||||||
----------------------------------------
|
========================================
|
||||||
|
|
||||||
Where Are the Logs?
|
Where Are the Logs?
|
||||||
^^^^^^^^^^^^^^^^^^^
|
-------------------
|
||||||
|
|
||||||
Some logs are stored in *journald*, but most are stored as text files in
|
Some logs are stored in *journald*, but most are stored as text files in
|
||||||
``/var/log``. Ironic and ironic-inspector logs are stored in journald. Note
|
``/var/log``. They are only accessible by the root user.
|
||||||
that Ironic has 2 units: ``openstack-ironic-api`` and
|
|
||||||
``openstack-ironic-conductor``. Similarly, ironic-inspector has
|
|
||||||
``openstack-ironic-inspector`` and ``openstack-ironic-inspector-dnsmasq``. So
|
|
||||||
for example to get all ironic-inspector logs use::
|
|
||||||
|
|
||||||
sudo journalctl -u openstack-ironic-inspector -u openstack-ironic-inspector-dnsmasq
|
ironic-inspector
|
||||||
|
~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
If something fails during the introspection ramdisk run, ironic-inspector
|
The introspection logs (from ironic-inspector) are located in
|
||||||
stores the ramdisk logs in ``/var/log/ironic-inspector/ramdisk/`` as
|
``/var/log/ironic-inspector``. If something fails during the introspection
|
||||||
gz-compressed tar files. File names contain date, time and IPMI address of the
|
ramdisk run, ironic-inspector stores the ramdisk logs in
|
||||||
node if it was detected (only for bare metal).
|
``/var/log/ironic-inspector/ramdisk/`` as gz-compressed tar files.
|
||||||
|
File names contain date, time and IPMI address of the node if it was detected
|
||||||
|
(only for bare metal).
|
||||||
|
|
||||||
|
To collect introspection logs on success as well, set
|
||||||
|
``always_store_ramdisk_logs = true`` in
|
||||||
|
``/etc/ironic-inspector/inspector.conf``, restart the
|
||||||
|
``openstack-ironic-inspector`` service and retry the introspection.
|
||||||
|
|
||||||
|
.. _ironic_logs:
|
||||||
|
|
||||||
|
ironic
|
||||||
|
~~~~~~
|
||||||
|
|
||||||
|
The deployment logs (from ironic) are located in ``/var/log/ironic``. If
|
||||||
|
something goes wrong during deployment or cleaning, the ramdisk logs are
|
||||||
|
stored in ``/var/log/ironic/deploy``. See `ironic logs retrieving documentation
|
||||||
|
<https://docs.openstack.org/ironic/latest/admin/troubleshooting.html#retrieving-logs-from-the-deploy-ramdisk>`_
|
||||||
|
for more details.
|
||||||
|
|
||||||
.. _node_registration_problems:
|
.. _node_registration_problems:
|
||||||
|
|
||||||
Node Registration Problems
|
Node Registration and Management Problems
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^
|
-----------------------------------------
|
||||||
|
|
||||||
|
Nodes in enroll state after registration
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
If you see your nodes staying in the ``enroll`` provision state after the
|
||||||
|
registration process (which may hang due to this), it means that Ironic is
|
||||||
|
unable to verify power management credentials, and you need to fix them.
|
||||||
|
Check the ``pm_addr``, ``pm_user`` and ``pm_password`` fields in your
|
||||||
|
``instackenv.json``. In some cases (e.g. when using
|
||||||
|
:doc:`../environments/virtualbmc`) you also need a correct ``pm_port``.
|
||||||
|
Update the node as explained in `Fixing invalid node information`_.
|
||||||
|
|
||||||
|
Fixing invalid node information
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
Any problems with node data registered into Ironic can be fixed using the
|
Any problems with node data registered into Ironic can be fixed using the
|
||||||
Ironic CLI.
|
Ironic CLI.
|
||||||
@ -31,16 +60,16 @@ For example, a wrong MAC can be fixed in two steps:
|
|||||||
* Find out the assigned port UUID by running
|
* Find out the assigned port UUID by running
|
||||||
::
|
::
|
||||||
|
|
||||||
ironic node-port-list <NODE UUID>
|
openstack baremetal port list --node <NODE UUID>
|
||||||
|
|
||||||
* Update the MAC address by running
|
* Update the MAC address by running
|
||||||
::
|
::
|
||||||
|
|
||||||
ironic port-update <PORT UUID> replace address=<NEW MAC>
|
openstack baremetal port set --address=<NEW MAC> <PORT UUID>
|
||||||
|
|
||||||
A Wrong IPMI address can be fixed with the following command::
|
A Wrong IPMI address can be fixed with the following command::
|
||||||
|
|
||||||
ironic node-update <NODE UUID> replace driver_info/ipmi_address=<NEW IPMI ADDRESS>
|
openstack baremetal node set <NODE UUID> --driver-info ipmi_address=<NEW IPMI ADDRESS>
|
||||||
|
|
||||||
Node power state is not enforced by Ironic
|
Node power state is not enforced by Ironic
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
@ -58,21 +87,58 @@ Also, note that if ``openstack undercloud install`` is re-run the value of
|
|||||||
the ``force_power_state_during_sync`` configuration option will be set back to
|
the ``force_power_state_during_sync`` configuration option will be set back to
|
||||||
the default, which is ``False``.
|
the default, which is ``False``.
|
||||||
|
|
||||||
|
How do I repair broken nodes
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
Usually, the nodes should only be deleted when the hardware is decomissioned.
|
||||||
|
Before that, you're expected to remove instances from them using scale-down.
|
||||||
|
However, in some cases, it may be impossible to repair a node with e.g. broken
|
||||||
|
power management, and it gets stuck in an abnormal state.
|
||||||
|
|
||||||
|
.. warning::
|
||||||
|
Before proceeding with this section, always try to decomission a node
|
||||||
|
normally, by scaling down your cloud. Forcing node deletion may cause
|
||||||
|
unpredicable results.
|
||||||
|
|
||||||
|
Ironic requires that nodes that cannot be operated normally are put in the
|
||||||
|
maintenance mode. It is done by the following command::
|
||||||
|
|
||||||
|
openstack baremetal node maintenance set <NODE UUID> --reason="<EXPLANATION>"
|
||||||
|
|
||||||
|
Ironic will stop checking power and health state for such nodes, and Nova will
|
||||||
|
not pick them for deployment. Power command will still work on them, though.
|
||||||
|
|
||||||
|
After a node is in the maintenance mode, you can attempt repairing it, e.g. by
|
||||||
|
`Fixing invalid node information`_. If you manage to make the node operational
|
||||||
|
again, move it out of the maintenance mode::
|
||||||
|
|
||||||
|
openstack baremetal node maintenance unset <NODE UUID>
|
||||||
|
|
||||||
|
If repairing is not possible, you can force deletion of such node::
|
||||||
|
|
||||||
|
openstack baremetal node delete <NODE UUID>
|
||||||
|
|
||||||
|
Forcing node removal will leave it powered on, accessing the network with
|
||||||
|
the old IP address(es) and with all services running. Before proceeding, make
|
||||||
|
sure to power it off and clean up via any means.
|
||||||
|
|
||||||
|
After that, the associated Nova instance is orphaned, and must be deleted.
|
||||||
|
You can do it normally via the scale down procedure.
|
||||||
|
|
||||||
.. _introspection_problems:
|
.. _introspection_problems:
|
||||||
|
|
||||||
Hardware Introspection Problems
|
Hardware Introspection Problems
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
-------------------------------
|
||||||
|
|
||||||
Introspection hangs and times out
|
Introspection hangs and times out
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
ironic-inspector times out introspection process after some time (defaulting to
|
ironic-inspector times out introspection process after some time (defaulting to
|
||||||
1 hour) if it never gets response from the introspection ramdisk. This can be
|
1 hour) if it never gets response from the introspection ramdisk. This can be
|
||||||
a sign of a bug in the introspection ramdisk, but usually it happens due to
|
a sign of a bug in the introspection ramdisk, but usually it happens due to
|
||||||
environment misconfiguration, particularly BIOS boot settings. Please refer to
|
environment misconfiguration, particularly BIOS boot settings. Please refer to
|
||||||
`ironic-inspector troubleshooting documentation
|
`ironic-inspector troubleshooting documentation
|
||||||
<http://docs.openstack.org/developer/ironic-inspector/troubleshooting.html>`_
|
<https://docs.openstack.org/ironic-inspector/latest/user/troubleshooting.html>`_
|
||||||
for information on how to detect and fix such problems.
|
for information on how to detect and fix such problems.
|
||||||
|
|
||||||
Accessing the ramdisk
|
Accessing the ramdisk
|
||||||
@ -89,7 +155,7 @@ manually. Find the line starting with "kernel" and append rootpwd="HASH" to it.
|
|||||||
Do not append the real password. Alternatively, you can append
|
Do not append the real password. Alternatively, you can append
|
||||||
sshkey="PUBLIC_SSH_KEY" with your public SSH key.
|
sshkey="PUBLIC_SSH_KEY" with your public SSH key.
|
||||||
|
|
||||||
.. note::
|
.. warning::
|
||||||
In both cases quotation marks are required!
|
In both cases quotation marks are required!
|
||||||
|
|
||||||
When ramdisk is running, figure out its IP address by checking ``arp`` utility
|
When ramdisk is running, figure out its IP address by checking ``arp`` utility
|
||||||
@ -107,32 +173,19 @@ SSH as a root user with the temporary password or the SSH key.
|
|||||||
image with the selinux-permissive element for diskimage-builder or by
|
image with the selinux-permissive element for diskimage-builder or by
|
||||||
passing selinux=0 in the kernel command line.
|
passing selinux=0 in the kernel command line.
|
||||||
|
|
||||||
Accessing logs from the ramdisk
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
Introspection logs are saved on ramdisk failures. Starting with the Newton
|
|
||||||
release, they are actually stored on all introspection failures. The standard
|
|
||||||
location is ``/var/log/ironic-inspector/ramdisk``, and the files there are
|
|
||||||
actually ``tar.gz`` without an extension.
|
|
||||||
|
|
||||||
To collect introspection logs in other cases, set
|
|
||||||
``always_store_ramdisk_logs = true`` in
|
|
||||||
``/etc/ironic-inspector/inspector.conf``, restart the
|
|
||||||
``openstack-ironic-inspector`` service and retry the introspection.
|
|
||||||
|
|
||||||
Refusing to introspect node with provision state "available"
|
Refusing to introspect node with provision state "available"
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
If you're running introspection directly using ironic-inspector CLI (or in case
|
If you're running introspection directly using ironic-inspector CLI (or in case
|
||||||
of bugs in our scripts), a node can be in the "AVAILABLE" state, which is meant
|
of bugs in our scripts), a node can be in the "AVAILABLE" state, which is meant
|
||||||
for deployment, not for introspection. You should advance node to the
|
for deployment, not for introspection. You should advance node to the
|
||||||
"MANAGEABLE" state before introspection and move it back before deployment.
|
"MANAGEABLE" state before introspection and move it back before deployment.
|
||||||
Please refer to `upstream node states documentation
|
Please refer to `upstream node states documentation
|
||||||
<http://docs.openstack.org/developer/ironic-inspector/usage.html#node-states>`_
|
<https://docs.openstack.org/ironic-inspector/latest/user/usage.html#node-states>`_
|
||||||
for information on how to fix it.
|
for information on how to fix it.
|
||||||
|
|
||||||
How can introspection be stopped?
|
How can introspection be stopped?
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
Introspection for a node can be stopped with the following command::
|
Introspection for a node can be stopped with the following command::
|
||||||
|
|
||||||
|
@ -31,8 +31,8 @@ Next, there are a few layers on which the deployment can fail:
|
|||||||
* Post-deploy configuration (Puppet)
|
* Post-deploy configuration (Puppet)
|
||||||
|
|
||||||
As Ironic service is in the middle layer, you can use its shell to guess the
|
As Ironic service is in the middle layer, you can use its shell to guess the
|
||||||
failed layer. Issue ``ironic node-list`` command to see all registered nodes
|
failed layer. Issue ``openstack baremetal node list`` command to see all
|
||||||
and their current status, you will see something like::
|
registered nodes and their current status, you will see something like::
|
||||||
|
|
||||||
+--------------------------------------+------+---------------+-------------+-----------------+-------------+
|
+--------------------------------------+------+---------------+-------------+-----------------+-------------+
|
||||||
| UUID | Name | Instance UUID | Power State | Provision State | Maintenance |
|
| UUID | Name | Instance UUID | Power State | Provision State | Maintenance |
|
||||||
@ -45,13 +45,21 @@ Pay close attention to **Provision State** and **Maintenance** columns
|
|||||||
in the resulting table.
|
in the resulting table.
|
||||||
|
|
||||||
* If the command shows empty table or less nodes that you expect, or
|
* If the command shows empty table or less nodes that you expect, or
|
||||||
**Maintenance** is ``True``, or **Provision State** is ``manageable``,
|
**Maintenance** is ``True``, or **Provision State** is ``manageable``
|
||||||
there was a problem during node enrolling and introspection.
|
or ``enroll``, there was a problem during node enrolling and introspection.
|
||||||
Please go back to these steps.
|
|
||||||
|
You can check the actual cause using the following command::
|
||||||
|
|
||||||
|
openstack baremetal node show <UUID> -f value -c maintenance_reason
|
||||||
|
|
||||||
For example, **Maintenance** goes to ``True`` automatically, if wrong power
|
For example, **Maintenance** goes to ``True`` automatically, if wrong power
|
||||||
credentials are provided.
|
credentials are provided.
|
||||||
|
|
||||||
|
Fix the cause of the failure, then move the node out of the maintenance
|
||||||
|
mode::
|
||||||
|
|
||||||
|
openstack baremetal node maintenance unset <NODE UUID>
|
||||||
|
|
||||||
* If **Provision State** is ``available`` then the problem occurred before
|
* If **Provision State** is ``available`` then the problem occurred before
|
||||||
bare metal deployment has even started. Proceed with `Debugging Using Heat`_.
|
bare metal deployment has even started. Proceed with `Debugging Using Heat`_.
|
||||||
|
|
||||||
@ -65,16 +73,12 @@ in the resulting table.
|
|||||||
changes.
|
changes.
|
||||||
|
|
||||||
* If **Provision State** is ``error`` or ``deploy failed``, then bare metal
|
* If **Provision State** is ``error`` or ``deploy failed``, then bare metal
|
||||||
deployment has failed for this node. Issue
|
deployment has failed for this node. Look at the **last_error** field::
|
||||||
::
|
|
||||||
|
|
||||||
ironic node-show <UUID>
|
openstack baremetal node show <UUID> -f value -c last_error
|
||||||
|
|
||||||
and look for **last_error** field. It will contain error description.
|
If the error message is vague, you can use logs to clarify it, see
|
||||||
|
:ref:`ironic_logs` for details.
|
||||||
If the error message is vague, you can use logs to clarify it::
|
|
||||||
|
|
||||||
sudo journalctl -u openstack-ironic-conductor -u openstack-ironic-api
|
|
||||||
|
|
||||||
If you see wait timeout error, and node **Power State** is ``power on``,
|
If you see wait timeout error, and node **Power State** is ``power on``,
|
||||||
then try to connect to the virtual console of the failed machine. Use
|
then try to connect to the virtual console of the failed machine. Use
|
||||||
@ -248,8 +252,13 @@ Start with checking `Ironic troubleshooting guide on this topic
|
|||||||
|
|
||||||
If you're using advanced profile matching with multiple flavors, make sure
|
If you're using advanced profile matching with multiple flavors, make sure
|
||||||
you have enough nodes corresponding to each flavor/profile. Watch
|
you have enough nodes corresponding to each flavor/profile. Watch
|
||||||
``capabilities`` key in ``properties`` field for ``ironic node-show``.
|
``capabilities`` key in the output of
|
||||||
It should contain e.g. ``profile:compute``.
|
|
||||||
|
::
|
||||||
|
|
||||||
|
openstack baremetal node show <UUID> --fields properties
|
||||||
|
|
||||||
|
It should contain e.g. ``profile:compute`` for compute nodes.
|
||||||
|
|
||||||
|
|
||||||
Debugging OpenStack services
|
Debugging OpenStack services
|
||||||
|
Loading…
x
Reference in New Issue
Block a user