Troubleshooting guide: node locked error
Change-Id: I225203816b030aac840922d817c9952a75cb7dc2
This commit is contained in:
parent
4b4f3f38c5
commit
004e1e8897
@ -469,7 +469,8 @@ the conductor is actively working on something related to the node.
|
|||||||
|
|
||||||
Often, this means there is an internal lock or ``reservation`` set on the node
|
Often, this means there is an internal lock or ``reservation`` set on the node
|
||||||
and the conductor is downloading, uploading, or attempting to perform some
|
and the conductor is downloading, uploading, or attempting to perform some
|
||||||
sort of Input/Output operation.
|
sort of Input/Output operation - see `Why does API return "Node is locked by
|
||||||
|
host"?`_ for details.
|
||||||
|
|
||||||
In the case the conductor gets stuck, these operations should timeout,
|
In the case the conductor gets stuck, these operations should timeout,
|
||||||
but there are cases in operating systems where operations are blocked until
|
but there are cases in operating systems where operations are blocked until
|
||||||
@ -888,3 +889,87 @@ This can be addressed a few different ways:
|
|||||||
of last resort" and you may need to reserve additional memory. You may
|
of last resort" and you may need to reserve additional memory. You may
|
||||||
also wish to adjust the ``[DEFAULT]minimum_memory_wait_retries`` and
|
also wish to adjust the ``[DEFAULT]minimum_memory_wait_retries`` and
|
||||||
``[DEFAULT]minimum_memory_wait_time`` parameters.
|
``[DEFAULT]minimum_memory_wait_time`` parameters.
|
||||||
|
|
||||||
|
Why does API return "Node is locked by host"?
|
||||||
|
=============================================
|
||||||
|
|
||||||
|
This error usually manifests as HTTP error 409 on the client side:
|
||||||
|
|
||||||
|
Node d7e2aed8-50a9-4427-baaa-f8f595e2ceb3 is locked by host 192.168.122.1,
|
||||||
|
please retry after the current operation is completed.
|
||||||
|
|
||||||
|
It happens, because an operation that modifies a node is requested, while
|
||||||
|
another such operation is running. The conflicting operation may be user
|
||||||
|
requested (e.g. a provisioning action) or related to the internal processes
|
||||||
|
(e.g. changing power state during :doc:`power-sync`). The reported host name
|
||||||
|
corresponds to the conductor instance that holds the lock.
|
||||||
|
|
||||||
|
Normally, these errors are transient and safe to retry after a few seconds. If
|
||||||
|
the lock is held for significant time, these are the steps you can take.
|
||||||
|
|
||||||
|
First of all, check the current ``provision_state`` of the node:
|
||||||
|
|
||||||
|
``verifying``
|
||||||
|
means that the conductor is trying to access the node's BMC.
|
||||||
|
If it happens for minutes, it means that the BMC is either unreachable or
|
||||||
|
misbehaving. Double-check the information in ``driver_info``, especially
|
||||||
|
the BMC address and credentials.
|
||||||
|
|
||||||
|
If the access details seem correct, try resetting the BMC using, for
|
||||||
|
example, its web UI.
|
||||||
|
|
||||||
|
``deploying``/``inspecting``/``cleaning``
|
||||||
|
means that the conductor is doing some active work. It may include
|
||||||
|
downloading or converting images, executing synchronous out-of-band deploy
|
||||||
|
or clean steps, etc. A node can stay in this state for minutes, depending
|
||||||
|
on various factors. Consult the conductor logs.
|
||||||
|
|
||||||
|
``available``/``manageable``/``wait call-back``/``clean wait``
|
||||||
|
means that some background process is holding the lock. Most commonly it's
|
||||||
|
the power synchronization loop. Similarly to the ``verifying`` state,
|
||||||
|
it may mean that the BMC access is broken or too slow. The conductor logs
|
||||||
|
will provide you insights on what is happening.
|
||||||
|
|
||||||
|
To trace the process using conductor logs:
|
||||||
|
|
||||||
|
#. Isolate the relevant log parts. Lock messages come from the
|
||||||
|
``ironic.conductor.task_manager`` module. You can also check the
|
||||||
|
``ironic.common.states`` module for any state transitions:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
$ grep -E '(ironic.conductor.task_manager|ironic.common.states|NodeLocked)' \
|
||||||
|
conductor.log > state.log
|
||||||
|
|
||||||
|
#. Find the first instance of ``NodeLocked``. It may look like this (stripping
|
||||||
|
timestamps and request IDs here and below for readability)::
|
||||||
|
|
||||||
|
DEBUG ironic.conductor.task_manager [-] Attempting to get exclusive lock on node d7e2aed8-50a9-4427-baaa-f8f595e2ceb3 (for node update) __init__ /usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py:233
|
||||||
|
DEBUG ironic_lib.json_rpc.server [-] RPC error NodeLocked: Node d7e2aed8-50a9-4427-baaa-f8f595e2ceb3 is locked by host 192.168.57.53, please retry after the current operation is completed. _handle_error /usr/lib/python3.6/site-packages/ironic_lib/json_rpc/server.py:179
|
||||||
|
|
||||||
|
The events right before this failure will provide you a clue on why the lock
|
||||||
|
is held.
|
||||||
|
|
||||||
|
#. Find the last successful **exclusive** locking event before the failure, for
|
||||||
|
example::
|
||||||
|
|
||||||
|
DEBUG ironic.conductor.task_manager [-] Attempting to get exclusive lock on node d7e2aed8-50a9-4427-baaa-f8f595e2ceb3 (for provision action manage) __init__ /usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py:233
|
||||||
|
DEBUG ironic.conductor.task_manager [-] Node d7e2aed8-50a9-4427-baaa-f8f595e2ceb3 successfully reserved for provision action manage (took 0.01 seconds) reserve_node /usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py:350
|
||||||
|
DEBUG ironic.common.states [-] Exiting old state 'enroll' in response to event 'manage' on_exit /usr/lib/python3.6/site-packages/ironic/common/states.py:307
|
||||||
|
DEBUG ironic.common.states [-] Entering new state 'verifying' in response to event 'manage' on_enter /usr/lib/python3.6/site-packages/ironic/common/states.py:313
|
||||||
|
|
||||||
|
This is your root cause, the lock is held because of the BMC credentials
|
||||||
|
verification.
|
||||||
|
|
||||||
|
#. Find when the lock is released (if at all). The messages look like this::
|
||||||
|
|
||||||
|
DEBUG ironic.conductor.task_manager [-] Successfully released exclusive lock for provision action manage on node d7e2aed8-50a9-4427-baaa-f8f595e2ceb3 (lock was held 60.02 sec) release_resources /usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py:447
|
||||||
|
|
||||||
|
The message tells you the reason the lock was held (``for provision action
|
||||||
|
manage``) and the amount of time it was held (60.02 seconds, which is way
|
||||||
|
too much for accessing a BMC).
|
||||||
|
|
||||||
|
Unfortunately, due to the way the conductor is designed, it is not possible to
|
||||||
|
gracefully break a stuck lock held in ``*-ing`` states. As the last resort, you
|
||||||
|
may need to restart the affected conductor. See `Why are my nodes stuck in a
|
||||||
|
"-ing" state?`_.
|
||||||
|
Loading…
x
Reference in New Issue
Block a user