tripleo-docs/doc/source/troubleshooting/troubleshooting-overcloud.rst
Petr Kovar 51bac7d227 Fix a TBD, broken links and markup
Change-Id: I839a593df2116264112a6060f1f306910cfba197
2017-02-02 19:15:57 +01:00

253 lines
18 KiB
ReStructuredText

Troubleshooting a Failed Overcloud Deployment
---------------------------------------------
If an Overcloud deployment has failed, the OpenStack clients and service log
files can be used to troubleshoot the failed deployment. The following commands
are all run on the Undercloud and assume a stackrc file has been sourced.
Identifying Failed Component
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In most cases, Heat will show the failed overcloud stack when a deployment
has failed::
$ heat stack-list
+--------------------------------------+------------+--------------------+----------------------+
| id | stack_name | stack_status | creation_time |
+--------------------------------------+------------+--------------------+----------------------+
| 7e88af95-535c-4a55-b78d-2c3d9850d854 | overcloud | CREATE_FAILED | 2015-04-06T17:57:16Z |
+--------------------------------------+------------+--------------------+----------------------+
Occasionally, Heat is not even able to create the stack, so the ``heat
stack-list`` output will be empty. If this is the case, observe the message
that was printed to the terminal when ``openstack overcloud deploy`` or ``heat
stack-create`` was run.
Next, there are a few layers on which the deployment can fail:
* Orchestration (Heat and Nova services)
* Bare metal provisioning (Ironic service)
* Post-deploy configuration (Puppet)
As Ironic service is in the middle layer, you can use its shell to guess the
failed layer. Issue ``ironic node-list`` command to see all registered nodes
and their current status, you will see something like::
+--------------------------------------+------+---------------+-------------+-----------------+-------------+
| UUID | Name | Instance UUID | Power State | Provision State | Maintenance |
+--------------------------------------+------+---------------+-------------+-----------------+-------------+
| f1e26112-5fbd-4fc4-9612-ecce7a1d86aa | None | None | power off | available | False |
| f0b8c105-f1d7-4059-a9a3-b050c3340340 | None | None | power off | available | False |
+--------------------------------------+------+---------------+-------------+-----------------+-------------+
Pay close attention to **Provision State** and **Maintenance** columns
in the resulting table.
* If the command shows empty table or less nodes that you expect, or
**Maintenance** is ``True``, or **Provision State** is ``manageable``,
there was a problem during node enrolling and introspection.
Please go back to these steps.
For example, **Maintenance** goes to ``True`` automatically, if wrong power
credentials are provided.
* If **Provision State** is ``available`` then the problem occurred before
bare metal deployment has even started. Proceed with `Debugging Using Heat`_.
* If **Provision State** is ``active`` and **Power State** is ``power on``,
then bare metal deployment has finished successfully, and problem happened
during the post-deployment configuration step. Again, refer to `Debugging
Using Heat`_.
* If **Provision State** is ``wait call-back``, then bare metal deployment is
not finished for this node yet. You may want to wait until the status
changes.
* If **Provision State** is ``error`` or ``deploy failed``, then bare metal
deployment has failed for this node. Issue
::
ironic node-show <UUID>
and look for **last_error** field. It will contain error description.
If the error message is vague, you can use logs to clarify it::
sudo journalctl -u openstack-ironic-conductor -u openstack-ironic-api
If you see wait timeout error, and node **Power State** is ``power on``,
then try to connect to the virtual console of the failed machine. Use
``virt-manager`` tool for virtual machines and vendor-specific virtual
console (e.g. iDRAC for DELL) for bare metal machines.
Debugging Using Heat
^^^^^^^^^^^^^^^^^^^^
* Identifying the failed Heat resource
List all the stack resources to see which one failed.
::
$ heat resource-list overcloud
+-----------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+
| resource_name | physical_resource_id | resource_type | resource_status | updated_time |
+-----------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+
| BlockStorage | 9e40a1ee-96d3-4920-868d-683d3788e129 | OS::Heat::ResourceGroup | CREATE_COMPLETE | 2015-04-06T21:15:20Z |
| BlockStorageAllNodesDeployment | 2c453f6b-7378-44c8-a0ad-57de57d9c57f | OS::Heat::StructuredDeployments | CREATE_COMPLETE | 2015-04-06T21:15:20Z |
| BlockStorageNodesPostDeployment | | OS::TripleO::BlockStoragePostDeployment | INIT_COMPLETE | 2015-04-06T21:15:20Z |
| CephClusterConfig | 1684e7a3-0e42-44fe-9db4-7543b742fbfc | OS::TripleO::CephClusterConfig::SoftwareConfig | CREATE_COMPLETE | 2015-04-06T21:15:20Z |
| CephStorage | 48b3460c-bf9a-4663-99fc-2b4fa01b8dc1 | OS::Heat::ResourceGroup | CREATE_COMPLETE | 2015-04-06T21:15:20Z |
| CephStorageAllNodesDeployment | 76beb3a9-8327-4d2e-a206-efe12f1613fb | OS::Heat::StructuredDeployments | CREATE_COMPLETE | 2015-04-06T21:15:20Z |
| CephStorageCephDeployment | af8fb02a-5bc6-468c-8fac-fbe7e5b2c689 | OS::Heat::StructuredDeployments | CREATE_COMPLETE | 2015-04-06T21:15:20Z |
| CephStorageNodesPostDeployment | | OS::TripleO::CephStoragePostDeployment | INIT_COMPLETE | 2015-04-06T21:15:20Z |
| Compute | e5e6ec84-197f-4bf6-b8ac-eb11fe494cdf | OS::Heat::ResourceGroup | CREATE_COMPLETE | 2015-04-06T21:15:20Z |
| ComputeAllNodesDeployment | e6d44fbf-9683-4765-acbb-4a3d31c8fd48 | OS::Heat::StructuredDeployments | CREATE_COMPLETE | 2015-04-06T21:15:20Z |
| ControllerNodesPostDeployment | e551e472-f2db-4468-b586-0374678d71a3 | OS::TripleO::ControllerPostDeployment | CREATE_FAILED | 2015-04-06T21:15:20Z |
| ComputeCephDeployment | 673608d5-70d7-453a-ac78-7987bc2c0158 | OS::Heat::StructuredDeployments | CREATE_COMPLETE | 2015-04-06T21:15:20Z |
| ComputeNodesPostDeployment | 1078e3e3-9f6f-48b9-8961-a30f44098856 | OS::TripleO::ComputePostDeployment | CREATE_COMPLETE | 2015-04-06T21:15:20Z |
| ControlVirtualIP | 6402b396-84aa-4cf6-9849-305205755604 | OS::Neutron::Port | CREATE_COMPLETE | 2015-04-06T21:15:20Z |
| Controller | ffc45352-9708-486d-81ac-3b60efa8e8b8 | OS::Heat::ResourceGroup | CREATE_COMPLETE | 2015-04-06T21:15:20Z |
| ControllerAllNodesDeployment | f73c6e33-3dd2-46f1-9eca-0d2981a4a986 | OS::Heat::StructuredDeployments | CREATE_COMPLETE | 2015-04-06T21:15:20Z |
| ControllerBootstrapNodeConfig | 01ce5b6a-794a-4828-bad9-49d5fbfd55bf | OS::TripleO::BootstrapNode::SoftwareConfig | CREATE_COMPLETE | 2015-04-06T21:15:20Z |
| ControllerBootstrapNodeDeployment | c963d53d-879b-4a41-a10a-9000ac9f02a1 | OS::Heat::StructuredDeployments | CREATE_COMPLETE | 2015-04-06T21:15:20Z |
| ControllerCephDeployment | 2d4281df-31ea-4433-820d-984a6dca6eb1 | OS::Heat::StructuredDeployments | CREATE_COMPLETE | 2015-04-06T21:15:20Z |
| ControllerClusterConfig | 719c0d30-a4b8-4f77-9ab6-b3c9759abeb3 | OS::Heat::StructuredConfig | CREATE_COMPLETE | 2015-04-06T21:15:20Z |
| ControllerClusterDeployment | d929aa40-1b73-429e-81d5-aaf966fa6756 | OS::Heat::StructuredDeployments | CREATE_COMPLETE | 2015-04-06T21:15:20Z |
| ControllerSwiftDeployment | cf28f9fe-025d-4eed-b3e5-3a5284a2aa60 | OS::Heat::StructuredDeployments | CREATE_COMPLETE | 2015-04-06T21:15:20Z |
| HeatAuthEncryptionKey | overcloud-HeatAuthEncryptionKey-5uw6wo7kavnq | OS::Heat::RandomString | CREATE_COMPLETE | 2015-04-06T21:15:20Z |
| MysqlClusterUniquePart | overcloud-MysqlClusterUniquePart-vazyj2s4n2o5 | OS::Heat::RandomString | CREATE_COMPLETE | 2015-04-06T21:15:20Z |
| MysqlRootPassword | overcloud-MysqlRootPassword-nek2iky7zfdm | OS::Heat::RandomString | CREATE_COMPLETE | 2015-04-06T21:15:20Z |
| ObjectStorage | 47327c98-533e-4cc2-b1f3-d8d0eedba822 | OS::Heat::ResourceGroup | CREATE_COMPLETE | 2015-04-06T21:15:20Z |
| ObjectStorageAllNodesDeployment | 7bb691aa-fa93-4f10-833e-6edeccc61408 | OS::Heat::StructuredDeployments | CREATE_COMPLETE | 2015-04-06T21:15:20Z |
| ObjectStorageNodesPostDeployment | d4d16f39-384a-4d6a-9719-1dd9b2d4ff09 | OS::TripleO::ObjectStoragePostDeployment | CREATE_COMPLETE | 2015-04-06T21:15:20Z |
| ObjectStorageSwiftDeployment | afc87385-8b40-4097-b529-2a5bc81c94c8 | OS::Heat::StructuredDeployments | CREATE_COMPLETE | 2015-04-06T21:15:20Z |
| PublicVirtualIP | 4dd92878-8f29-49d8-9d3d-bc0cd44d26a9 | OS::Neutron::Port | CREATE_COMPLETE | 2015-04-06T21:15:20Z |
| RabbitCookie | overcloud-RabbitCookie-uthzbos3l66v | OS::Heat::RandomString | CREATE_COMPLETE | 2015-04-06T21:15:20Z |
| SwiftDevicesAndProxyConfig | e2141170-bb77-4509-b8bd-58447b2cd15f | OS::TripleO::SwiftDevicesAndProxy::SoftwareConfig | CREATE_COMPLETE | 2015-04-06T21:15:20Z |
| allNodesConfig | cbd42692-fffa-4527-a519-bd4014ebf0fb | OS::TripleO::AllNodes::SoftwareConfig | CREATE_COMPLETE | 2015-04-06T21:15:20Z |
+-----------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+
In this example, notice how the **ControllerNodesPostDeployment** resource
has failed. The **\*PostDeployment** resources are the configuration that is
applied to the deployed Overcloud nodes. When these resources have failed it
indicates that something went wrong during the Overcloud node configuration,
perhaps when Puppet was run.
* Show the failed resource
::
$ heat resource-show overcloud ControllerNodesPostDeployment
+------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Property | Value |
+------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| attributes | {} |
| description | |
| links | http://192.168.24.1:8004/v1/cea2a0c78d2447bc9a0f7caa35c9224c/stacks/overcloud/ec3e3251-f949-4df9-92be-dbd37c6992a1/resources/ControllerNodesPostDeployment (self) |
| | http://192.168.24.1:8004/v1/cea2a0c78d2447bc9a0f7caa35c9224c/stacks/overcloud/ec3e3251-f949-4df9-92be-dbd37c6992a1 (stack) |
| | http://192.168.24.1:8004/v1/cea2a0c78d2447bc9a0f7caa35c9224c/stacks/overcloud-ControllerNodesPostDeployment-6kcqm5zuymqu/e551e472-f2db-4468-b586-0374678d71a3 (nested) |
| logical_resource_id | ControllerNodesPostDeployment |
| physical_resource_id | e551e472-f2db-4468-b586-0374678d71a3 |
| required_by | BlockStorageNodesPostDeployment |
| | CephStorageNodesPostDeployment |
| resource_name | ControllerNodesPostDeployment |
| resource_status | CREATE_FAILED |
| resource_status_reason | ResourceUnknownStatus: Resource failed - Unknown status FAILED due to "None" |
| resource_type | OS::TripleO::ControllerPostDeployment |
| updated_time | 2015-04-06T21:15:20Z |
+------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
The ``resource-show`` doesn't always show a clear reason why the resource
failed. In these cases, logging into the Overcloud node is required to
further troubleshoot the issue.
* Logging into Overcloud nodes
Use the nova client to see the IP addresses of the Overcloud nodes.
::
$ nova list
+--------------------------------------+-------------------------------------------------------+--------+------------+-------------+---------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+-------------------------------------------------------+--------+------------+-------------+---------------------+
| 18014b02-b143-4ca2-aeb9-5553bec93cff | ov-4tvbtgpv7w-0-soqocxy2w4fr-NovaCompute-nlrxd3lgmmlt | ACTIVE | - | Running | ctlplane=192.168.24.13 |
| 96a57a46-1e48-4c66-adaa-342ee4e98972 | ov-rf4hby6sblk-0-iso3zlqmyzfe-Controller-xm2imjkzalhi | ACTIVE | - | Running | ctlplane=192.168.24.14 |
+--------------------------------------+-------------------------------------------------------+--------+------------+-------------+---------------------+
Login as the ``heat-admin`` user to one of the deployed nodes. In this
example, since the **ControllerNodesPostDeployment** resource failed, login
to the controller node. The ``heat-admin`` user has sudo access.
::
$ ssh heat-admin@192.168.24.14
While logged in to the controller node, examine the log for the
``os-collect-config`` log for a possible reason for the failure.
::
$ sudo journalctl -u os-collect-config
* Failed Nova Server ResourceGroup Deployments
In some cases, Nova fails deploying the node in entirety. This situation
would be indicated by a failed ``OS::Heat::ResourceGroup`` for one of the
Overcloud role types such as Control or Compute.
Use nova to see the failure in this case.
::
$ nova list
$ nova show <server-id>
The most common error shown will reference the error message ``No valid host
was found``. Refer to `No Valid Host Found Error`_ below.
In other cases, look at the following log files for further troubleshooting::
/var/log/nova/*
/var/log/heat/*
/var/log/ironic/*
* Using SOS
SOS is a set of tools that gathers information about system hardware and
configuration. The information can then be used for diagnostic purposes and
debugging. SOS is commonly used to help support technicians and developers.
SOS is useful on both the undercloud and overcloud. Install the ``sos``
package and then generate a report::
$ sudo sosreport --all-logs
.. _no-valid-host:
No Valid Host Found Error
^^^^^^^^^^^^^^^^^^^^^^^^^
Sometimes ``/var/log/nova/nova-conductor.log`` contains the following error::
NoValidHost: No valid host was found. There are not enough hosts available.
"No valid host was found" means that the Nova Scheduler could not find a bare
metal node suitable for booting the new instance.
This in turn usually means some mismatch between resources that Nova expects
to find and resources that Ironic advertised to Nova.
Start with checking `Ironic troubleshooting guide on this topic
<http://docs.openstack.org/developer/ironic/deploy/troubleshooting.html#nova-returns-no-valid-host-was-found-error>`_.
If you're using advanced profile matching with multiple flavors, make sure
you have enough nodes corresponding to each flavor/profile. Watch
``capabilities`` key in ``properties`` field for ``ironic node-show``.
It should contain e.g. ``profile:compute``.