Actualize ironic-related troubleshooting bits

* New content on removing broken nodes via maintenance mode
* Update the logs locations, add information on deployment logs
* Fix headings in the nodes troubleshooting guide
* Use modern (and more convenient in many cases) OSC commands

Change-Id: I0a0c83cde14d25cf76b4b68b1c1696bbd73e69db
This commit is contained in:
Dmitry Tantsur 2017-08-22 18:09:18 +02:00
parent 2388c54a03
commit ea7ec0312e
2 changed files with 114 additions and 52 deletions

View File

@ -1,27 +1,56 @@
Troubleshooting Node Management Failures Troubleshooting Node Management Failures
---------------------------------------- ========================================
Where Are the Logs? Where Are the Logs?
^^^^^^^^^^^^^^^^^^^ -------------------
Some logs are stored in *journald*, but most are stored as text files in Some logs are stored in *journald*, but most are stored as text files in
``/var/log``. Ironic and ironic-inspector logs are stored in journald. Note ``/var/log``. They are only accessible by the root user.
that Ironic has 2 units: ``openstack-ironic-api`` and
``openstack-ironic-conductor``. Similarly, ironic-inspector has
``openstack-ironic-inspector`` and ``openstack-ironic-inspector-dnsmasq``. So
for example to get all ironic-inspector logs use::
sudo journalctl -u openstack-ironic-inspector -u openstack-ironic-inspector-dnsmasq ironic-inspector
~~~~~~~~~~~~~~~~
If something fails during the introspection ramdisk run, ironic-inspector The introspection logs (from ironic-inspector) are located in
stores the ramdisk logs in ``/var/log/ironic-inspector/ramdisk/`` as ``/var/log/ironic-inspector``. If something fails during the introspection
gz-compressed tar files. File names contain date, time and IPMI address of the ramdisk run, ironic-inspector stores the ramdisk logs in
node if it was detected (only for bare metal). ``/var/log/ironic-inspector/ramdisk/`` as gz-compressed tar files.
File names contain date, time and IPMI address of the node if it was detected
(only for bare metal).
To collect introspection logs on success as well, set
``always_store_ramdisk_logs = true`` in
``/etc/ironic-inspector/inspector.conf``, restart the
``openstack-ironic-inspector`` service and retry the introspection.
.. _ironic_logs:
ironic
~~~~~~
The deployment logs (from ironic) are located in ``/var/log/ironic``. If
something goes wrong during deployment or cleaning, the ramdisk logs are
stored in ``/var/log/ironic/deploy``. See `ironic logs retrieving documentation
<https://docs.openstack.org/ironic/latest/admin/troubleshooting.html#retrieving-logs-from-the-deploy-ramdisk>`_
for more details.
.. _node_registration_problems: .. _node_registration_problems:
Node Registration Problems Node Registration and Management Problems
^^^^^^^^^^^^^^^^^^^^^^^^^^ -----------------------------------------
Nodes in enroll state after registration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you see your nodes staying in the ``enroll`` provision state after the
registration process (which may hang due to this), it means that Ironic is
unable to verify power management credentials, and you need to fix them.
Check the ``pm_addr``, ``pm_user`` and ``pm_password`` fields in your
``instackenv.json``. In some cases (e.g. when using
:doc:`../environments/virtualbmc`) you also need a correct ``pm_port``.
Update the node as explained in `Fixing invalid node information`_.
Fixing invalid node information
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Any problems with node data registered into Ironic can be fixed using the Any problems with node data registered into Ironic can be fixed using the
Ironic CLI. Ironic CLI.
@ -31,16 +60,16 @@ For example, a wrong MAC can be fixed in two steps:
* Find out the assigned port UUID by running * Find out the assigned port UUID by running
:: ::
ironic node-port-list <NODE UUID> openstack baremetal port list --node <NODE UUID>
* Update the MAC address by running * Update the MAC address by running
:: ::
ironic port-update <PORT UUID> replace address=<NEW MAC> openstack baremetal port set --address=<NEW MAC> <PORT UUID>
A Wrong IPMI address can be fixed with the following command:: A Wrong IPMI address can be fixed with the following command::
ironic node-update <NODE UUID> replace driver_info/ipmi_address=<NEW IPMI ADDRESS> openstack baremetal node set <NODE UUID> --driver-info ipmi_address=<NEW IPMI ADDRESS>
Node power state is not enforced by Ironic Node power state is not enforced by Ironic
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@ -58,21 +87,58 @@ Also, note that if ``openstack undercloud install`` is re-run the value of
the ``force_power_state_during_sync`` configuration option will be set back to the ``force_power_state_during_sync`` configuration option will be set back to
the default, which is ``False``. the default, which is ``False``.
How do I repair broken nodes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Usually, the nodes should only be deleted when the hardware is decomissioned.
Before that, you're expected to remove instances from them using scale-down.
However, in some cases, it may be impossible to repair a node with e.g. broken
power management, and it gets stuck in an abnormal state.
.. warning::
Before proceeding with this section, always try to decomission a node
normally, by scaling down your cloud. Forcing node deletion may cause
unpredicable results.
Ironic requires that nodes that cannot be operated normally are put in the
maintenance mode. It is done by the following command::
openstack baremetal node maintenance set <NODE UUID> --reason="<EXPLANATION>"
Ironic will stop checking power and health state for such nodes, and Nova will
not pick them for deployment. Power command will still work on them, though.
After a node is in the maintenance mode, you can attempt repairing it, e.g. by
`Fixing invalid node information`_. If you manage to make the node operational
again, move it out of the maintenance mode::
openstack baremetal node maintenance unset <NODE UUID>
If repairing is not possible, you can force deletion of such node::
openstack baremetal node delete <NODE UUID>
Forcing node removal will leave it powered on, accessing the network with
the old IP address(es) and with all services running. Before proceeding, make
sure to power it off and clean up via any means.
After that, the associated Nova instance is orphaned, and must be deleted.
You can do it normally via the scale down procedure.
.. _introspection_problems: .. _introspection_problems:
Hardware Introspection Problems Hardware Introspection Problems
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -------------------------------
Introspection hangs and times out Introspection hangs and times out
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ironic-inspector times out introspection process after some time (defaulting to ironic-inspector times out introspection process after some time (defaulting to
1 hour) if it never gets response from the introspection ramdisk. This can be 1 hour) if it never gets response from the introspection ramdisk. This can be
a sign of a bug in the introspection ramdisk, but usually it happens due to a sign of a bug in the introspection ramdisk, but usually it happens due to
environment misconfiguration, particularly BIOS boot settings. Please refer to environment misconfiguration, particularly BIOS boot settings. Please refer to
`ironic-inspector troubleshooting documentation `ironic-inspector troubleshooting documentation
<http://docs.openstack.org/developer/ironic-inspector/troubleshooting.html>`_ <https://docs.openstack.org/ironic-inspector/latest/user/troubleshooting.html>`_
for information on how to detect and fix such problems. for information on how to detect and fix such problems.
Accessing the ramdisk Accessing the ramdisk
@ -89,7 +155,7 @@ manually. Find the line starting with "kernel" and append rootpwd="HASH" to it.
Do not append the real password. Alternatively, you can append Do not append the real password. Alternatively, you can append
sshkey="PUBLIC_SSH_KEY" with your public SSH key. sshkey="PUBLIC_SSH_KEY" with your public SSH key.
.. note:: .. warning::
In both cases quotation marks are required! In both cases quotation marks are required!
When ramdisk is running, figure out its IP address by checking ``arp`` utility When ramdisk is running, figure out its IP address by checking ``arp`` utility
@ -107,32 +173,19 @@ SSH as a root user with the temporary password or the SSH key.
image with the selinux-permissive element for diskimage-builder or by image with the selinux-permissive element for diskimage-builder or by
passing selinux=0 in the kernel command line. passing selinux=0 in the kernel command line.
Accessing logs from the ramdisk
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Introspection logs are saved on ramdisk failures. Starting with the Newton
release, they are actually stored on all introspection failures. The standard
location is ``/var/log/ironic-inspector/ramdisk``, and the files there are
actually ``tar.gz`` without an extension.
To collect introspection logs in other cases, set
``always_store_ramdisk_logs = true`` in
``/etc/ironic-inspector/inspector.conf``, restart the
``openstack-ironic-inspector`` service and retry the introspection.
Refusing to introspect node with provision state "available" Refusing to introspect node with provision state "available"
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you're running introspection directly using ironic-inspector CLI (or in case If you're running introspection directly using ironic-inspector CLI (or in case
of bugs in our scripts), a node can be in the "AVAILABLE" state, which is meant of bugs in our scripts), a node can be in the "AVAILABLE" state, which is meant
for deployment, not for introspection. You should advance node to the for deployment, not for introspection. You should advance node to the
"MANAGEABLE" state before introspection and move it back before deployment. "MANAGEABLE" state before introspection and move it back before deployment.
Please refer to `upstream node states documentation Please refer to `upstream node states documentation
<http://docs.openstack.org/developer/ironic-inspector/usage.html#node-states>`_ <https://docs.openstack.org/ironic-inspector/latest/user/usage.html#node-states>`_
for information on how to fix it. for information on how to fix it.
How can introspection be stopped? How can introspection be stopped?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Introspection for a node can be stopped with the following command:: Introspection for a node can be stopped with the following command::

View File

@ -31,8 +31,8 @@ Next, there are a few layers on which the deployment can fail:
* Post-deploy configuration (Puppet) * Post-deploy configuration (Puppet)
As Ironic service is in the middle layer, you can use its shell to guess the As Ironic service is in the middle layer, you can use its shell to guess the
failed layer. Issue ``ironic node-list`` command to see all registered nodes failed layer. Issue ``openstack baremetal node list`` command to see all
and their current status, you will see something like:: registered nodes and their current status, you will see something like::
+--------------------------------------+------+---------------+-------------+-----------------+-------------+ +--------------------------------------+------+---------------+-------------+-----------------+-------------+
| UUID | Name | Instance UUID | Power State | Provision State | Maintenance | | UUID | Name | Instance UUID | Power State | Provision State | Maintenance |
@ -45,13 +45,21 @@ Pay close attention to **Provision State** and **Maintenance** columns
in the resulting table. in the resulting table.
* If the command shows empty table or less nodes that you expect, or * If the command shows empty table or less nodes that you expect, or
**Maintenance** is ``True``, or **Provision State** is ``manageable``, **Maintenance** is ``True``, or **Provision State** is ``manageable``
there was a problem during node enrolling and introspection. or ``enroll``, there was a problem during node enrolling and introspection.
Please go back to these steps.
You can check the actual cause using the following command::
openstack baremetal node show <UUID> -f value -c maintenance_reason
For example, **Maintenance** goes to ``True`` automatically, if wrong power For example, **Maintenance** goes to ``True`` automatically, if wrong power
credentials are provided. credentials are provided.
Fix the cause of the failure, then move the node out of the maintenance
mode::
openstack baremetal node maintenance unset <NODE UUID>
* If **Provision State** is ``available`` then the problem occurred before * If **Provision State** is ``available`` then the problem occurred before
bare metal deployment has even started. Proceed with `Debugging Using Heat`_. bare metal deployment has even started. Proceed with `Debugging Using Heat`_.
@ -65,16 +73,12 @@ in the resulting table.
changes. changes.
* If **Provision State** is ``error`` or ``deploy failed``, then bare metal * If **Provision State** is ``error`` or ``deploy failed``, then bare metal
deployment has failed for this node. Issue deployment has failed for this node. Look at the **last_error** field::
::
ironic node-show <UUID> openstack baremetal node show <UUID> -f value -c last_error
and look for **last_error** field. It will contain error description. If the error message is vague, you can use logs to clarify it, see
:ref:`ironic_logs` for details.
If the error message is vague, you can use logs to clarify it::
sudo journalctl -u openstack-ironic-conductor -u openstack-ironic-api
If you see wait timeout error, and node **Power State** is ``power on``, If you see wait timeout error, and node **Power State** is ``power on``,
then try to connect to the virtual console of the failed machine. Use then try to connect to the virtual console of the failed machine. Use
@ -248,8 +252,13 @@ Start with checking `Ironic troubleshooting guide on this topic
If you're using advanced profile matching with multiple flavors, make sure If you're using advanced profile matching with multiple flavors, make sure
you have enough nodes corresponding to each flavor/profile. Watch you have enough nodes corresponding to each flavor/profile. Watch
``capabilities`` key in ``properties`` field for ``ironic node-show``. ``capabilities`` key in the output of
It should contain e.g. ``profile:compute``.
::
openstack baremetal node show <UUID> --fields properties
It should contain e.g. ``profile:compute`` for compute nodes.
Debugging OpenStack services Debugging OpenStack services