Merge "Long Latency Between System Controller and Subclouds (r9, dsr8MR3)"
This commit is contained in:
commit
5cb9e78a32
@ -0,0 +1,2 @@
|
|||||||
|
.. licenseexpirationalarm-begin
|
||||||
|
.. licenseexpirationalarm-end
|
@ -58,6 +58,7 @@ Operation
|
|||||||
delete-subcloud-backup-data-using-dcmanager-cli-9cabe48bc4fd
|
delete-subcloud-backup-data-using-dcmanager-cli-9cabe48bc4fd
|
||||||
restore-a-subcloud-group-of-subclouds-from-backup-data-using-dcmanager-cli-f10c1b63a95e
|
restore-a-subcloud-group-of-subclouds-from-backup-data-using-dcmanager-cli-f10c1b63a95e
|
||||||
rehoming-a-subcloud
|
rehoming-a-subcloud
|
||||||
|
rehoming-subcloud-with-expired-certificates-00549c4ea6e2
|
||||||
rename-subcloud-e303565e7192
|
rename-subcloud-e303565e7192
|
||||||
prestage-a-subcloud-using-dcmanager-df756866163f
|
prestage-a-subcloud-using-dcmanager-df756866163f
|
||||||
add-a-horizon-keystone-user-to-distributed-cloud-29655b0f0eb9
|
add-a-horizon-keystone-user-to-distributed-cloud-29655b0f0eb9
|
||||||
|
@ -0,0 +1,205 @@
|
|||||||
|
.. _rehoming-subcloud-with-expired-certificates-00549c4ea6e2:
|
||||||
|
|
||||||
|
===========================================
|
||||||
|
Rehoming Subcloud with Expired Certificates
|
||||||
|
===========================================
|
||||||
|
|
||||||
|
The rehoming procedure for subcloud that has been powered off for a long period of
|
||||||
|
time differs from the regular rehoming procedure. Depending on how long the
|
||||||
|
subcloud has been offline, the platform certificates may expire and require regeneration.
|
||||||
|
|
||||||
|
If the certificates are recoverable, the rehoming playbook will automatically
|
||||||
|
recover most of them. However, some certificates will require manual
|
||||||
|
intervention. The playbook will fail and :command:`dcmanager subcloud errors subcloud`
|
||||||
|
will indicate the actions that need to be taken.
|
||||||
|
|
||||||
|
.. rubric:: |proc|
|
||||||
|
|
||||||
|
#. Power on controller-0 of the subcloud.
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
Ensure that you can ping the |OAM| floating IP from the new system controller
|
||||||
|
before proceeding.
|
||||||
|
|
||||||
|
#. SSH to the subcloud as sysadmin. If the password has expired, a prompt will
|
||||||
|
pop up requesting to update the sysadmin password.
|
||||||
|
|
||||||
|
#. Proceed with rehoming.
|
||||||
|
|
||||||
|
-----------------
|
||||||
|
Multi-node system
|
||||||
|
-----------------
|
||||||
|
|
||||||
|
In DX and standard subclouds (subcloud with 2 controllers), the subcloud may be
|
||||||
|
in a continuous swact/reboot cycle that will lead to an unstable controller
|
||||||
|
for the rehoming procedure to target.
|
||||||
|
|
||||||
|
- Ensure that you power off controller-1 before attempting the rehoming procedure.
|
||||||
|
Otherwise, the playbook will fail with an error ``Certificate
|
||||||
|
recovery in progress. Please power-off controller-1 and try again``.
|
||||||
|
|
||||||
|
- The rehoming playbook will run and recover the active controller of the
|
||||||
|
subcloud, after which it will display ``Running certificate recovery on other
|
||||||
|
nodes. Connect to the subcloud and run 'tail -f /root/ansible.log' to follow
|
||||||
|
the logs.``. This means that another ansible process is running in the
|
||||||
|
subcloud and you can review the log for more details.
|
||||||
|
|
||||||
|
- At the Running certificate recovery on other nodes step, controller-1
|
||||||
|
should be powered on automatically. If not, a message will be written to
|
||||||
|
``/root/ansible.log`` asking for manual intervention to power it on.
|
||||||
|
|
||||||
|
The following error indicates that controller-1 should be powered off first for
|
||||||
|
subcloud active controller certificate recovery:
|
||||||
|
|
||||||
|
.. code-block::
|
||||||
|
|
||||||
|
[sysadmin@controller-0 ~(keystone_admin)]$ dcmanager subcloud errors subcloud3
|
||||||
|
FAILED rehoming playbook of (subcloud3).
|
||||||
|
detail: fatal: [subcloud3]: FAILED! => changed=false
|
||||||
|
msg: Certificate recovery in progress. Please power-off controller-1 and try again.
|
||||||
|
FAILED TASK: TASK [common/recover-subcloud-certificates : Fail if controller-1 is running] Thursday 15 March 2035 00:01:03 +0000 (0:00:00.439) 0:00:08.467
|
||||||
|
|
||||||
|
If you get this error, turn off controller-1 and try again.
|
||||||
|
|
||||||
|
-----------------------------
|
||||||
|
Manually Managed Certificates
|
||||||
|
-----------------------------
|
||||||
|
|
||||||
|
Manual certificates are those that are manually installed by the user using the
|
||||||
|
:command:`system certificate-install` command. Examples include the StarlingX
|
||||||
|
REST API & Horizon Server certificate and Local Registry Server certificate.
|
||||||
|
It is not possible to automatically recover manual certificates.
|
||||||
|
|
||||||
|
As automatic recovery is not possible, the rehoming procedure will fail and ask
|
||||||
|
for manual intervention:
|
||||||
|
|
||||||
|
.. code-block::
|
||||||
|
|
||||||
|
[sysadmin@controller-0 ~(keystone_admin)]$ dcmanager subcloud errors subcloud3
|
||||||
|
FAILED rehoming playbook of (subcloud3).
|
||||||
|
detail: fatal: [subcloud3]: FAILED! => changed=false
|
||||||
|
msg: |-
|
||||||
|
Rest API and Docker Registry certificates are expired. Manual action required! On the subcloud, please update the expired certificates with `system certificate-install` and then run "dcmanager subcloud delete" and "dcmanager subcloud add" again to restart the procedure.
|
||||||
|
TASK [common/recover-subcloud-certificates : Delete root ca key file after use in compute nodes] ***
|
||||||
|
Wednesday 14 March 2035 22:52:22 +0000 (0:00:00.026) 0:03:12.115 *******
|
||||||
|
skipping: [subcloud3]
|
||||||
|
|
||||||
|
If you get this error, generate new certificates for the aforementioned
|
||||||
|
certificates, install them with certificate-install, and try again.
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
This will not be required if the certificates are already managed by cert-manager.
|
||||||
|
|
||||||
|
--------------------------------------------------
|
||||||
|
Cert-manager Certificates using a Custom CA Issuer
|
||||||
|
--------------------------------------------------
|
||||||
|
|
||||||
|
If you are using a Cert-manager Issuer other than ``system-local-ca`` for platform
|
||||||
|
certificates, you will get the following error:
|
||||||
|
|
||||||
|
.. code-block::
|
||||||
|
|
||||||
|
[sysadmin@controller-0 dc-config(keystone_admin)]$ dcmanager subcloud error subcloud1
|
||||||
|
FAILED rehoming playbook of (subcloud1).
|
||||||
|
detail: fatal: [subcloud1]: FAILED! => changed=false
|
||||||
|
msg: Cert-manager certificate(s) with their issuer expired. Please verify secret(s)
|
||||||
|
deployment/cloudplatform-rootca-secret on the subcloud, manually update and try
|
||||||
|
again."
|
||||||
|
TASK [common/recover-subcloud-certificates : Delete root ca key file after use in compute nodes] ***
|
||||||
|
Saturday 03 March 2035 18:56:00 +0000 (0:00:00.042) 0:02:42.799 ********
|
||||||
|
skipping: [subcloud1]
|
||||||
|
FAILED TASK: TASK [common/recover-subcloud-certificates : Delete root ca key file after use in compute nodes] Saturday 03 March 2035 18:56:00 +0000 (0:00:00.042)
|
||||||
|
0:02:42.799
|
||||||
|
|
||||||
|
In this case, manual update of the underlying Issuer's secret will be necessary.
|
||||||
|
|
||||||
|
As an example, the above error mentions deployment/cloudplatform-rootca-secret,
|
||||||
|
where deployment is the K8s namespace and cloudplatform-rootca-secret is the secret name.
|
||||||
|
To update the |CA| certificate in this secret, use the following commands:
|
||||||
|
|
||||||
|
.. code-block::
|
||||||
|
|
||||||
|
kubectl -n deployment delete secret cloudplatform-rootca-secret
|
||||||
|
kubectl -n deployment create secret tls cloudplatform-rootca-secret --key=./ca.key --cert=./ca.crt
|
||||||
|
rm ca.crt ca.key
|
||||||
|
|
||||||
|
``ca.crt`` and ``ca.key`` are in pem format. They can be obtained from the
|
||||||
|
security personnel or the team responsible for certificate management.
|
||||||
|
|
||||||
|
---------------------------
|
||||||
|
Management Affecting Alarms
|
||||||
|
---------------------------
|
||||||
|
|
||||||
|
Once the certificate recovery process is completed, the subclouds should be free of
|
||||||
|
management affecting alarms. The management affecting alarms will cause the rehoming
|
||||||
|
procedure to fail. The subcloud may still be recoverable and the alarms should
|
||||||
|
indicate the condition and provide information on the next step.
|
||||||
|
|
||||||
|
.. code-block::
|
||||||
|
|
||||||
|
[sysadmin@controller-0 ~(keystone_admin)]$ dcmanager subcloud errors subcloud3
|
||||||
|
FAILED rehoming playbook of (subcloud3).
|
||||||
|
detail: fatal: [subcloud3]: FAILED! => changed=false
|
||||||
|
msg: The subcloud has management affecting alarms which are blocking the rehoming
|
||||||
|
procedure from continuing. The subcloud may still be recoverable, connect to it and
|
||||||
|
run "fm alarm-list --mgmt_affecting" to check the alarms. Please resolve the alarm
|
||||||
|
condition(s) then try again.
|
||||||
|
TASK [common/recover-subcloud-certificates : Delete root ca key file after use in compute nodes] ***
|
||||||
|
Wednesday 14 March 2035 23:45:44 +0000 (0:00:00.020) 0:42:53.295 *******
|
||||||
|
skipping: [subcloud3]
|
||||||
|
FAILED TASK: TASK [common/recover-subcloud-certificates : Delete root ca key file
|
||||||
|
after use in compute nodes] Wednesday 14 March 2035 23:45:44 +0000 (0:00:00.020)
|
||||||
|
0:42:53.295
|
||||||
|
|
||||||
|
In this case, review the active alarms and take the necessary actions to resolve them.
|
||||||
|
|
||||||
|
.. only:: partner
|
||||||
|
|
||||||
|
.. include:: /_includes/rehoming-subcloud-with-expired-certificates.rest
|
||||||
|
:start-after: licenseexpirationalarm-begin
|
||||||
|
:end-before: licenseexpirationalarm-end
|
||||||
|
|
||||||
|
-------------------
|
||||||
|
SSL CA Certificates
|
||||||
|
-------------------
|
||||||
|
|
||||||
|
SSL CA certificates are not automatically recovered as part of the rehoming procedure.
|
||||||
|
|
||||||
|
After a successful rehoming, an alarm will be raised by the system to let users
|
||||||
|
know about the expiration of SSL CA certificates:
|
||||||
|
|
||||||
|
.. code-block::
|
||||||
|
|
||||||
|
[sysadmin@controller-0 ~(keystone_admin)]$ fm alarm-list
|
||||||
|
+----------+----------------------------------------------------------------------------------------------------------+--------------------------------------+----------+---------------------+
|
||||||
|
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
|
||||||
|
+----------+----------------------------------------------------------------------------------------------------------+--------------------------------------+----------+---------------------+
|
||||||
|
| 500.210 | Certificate 'system certificate-show 9062a088-8c71-46c6-b194-6a65908f1080' (mode=ssl_ca) expired. | system.certificate.mode=ssl_ca.uuid= | critical | 2035-03-19T23:50:22 |
|
||||||
|
| | | 9062a088-8c71-46c6-b194-6a65908f1080 | | .917781 |
|
||||||
|
+----------+----------------------------------------------------------------------------------------------------------+--------------------------------------+----------+---------------------+
|
||||||
|
|
||||||
|
The alarm indicates that the certificate has expired. For more information
|
||||||
|
about the certificate, run ``sudo show-certs.sh``. The following are the two
|
||||||
|
possible resolutions:
|
||||||
|
|
||||||
|
- The certificate is no longer needed
|
||||||
|
|
||||||
|
.. code-block::
|
||||||
|
|
||||||
|
system certificate-list | grep ssl_ca
|
||||||
|
system certificate-uninstall -m ssl_ca <expired_certificate_uuid>
|
||||||
|
|
||||||
|
- The certificate is needed
|
||||||
|
|
||||||
|
.. code-block::
|
||||||
|
|
||||||
|
system certificate-list | grep ssl_ca
|
||||||
|
system certificate-uninstall -m ssl_ca <expired_certificate_uuid>
|
||||||
|
|
||||||
|
Obtain and install the new version of the required certificate:
|
||||||
|
|
||||||
|
.. code-block::
|
||||||
|
|
||||||
|
system certificate-install -m ssl_ca <new_ssl_ca>
|
Loading…
x
Reference in New Issue
Block a user