Long Latency Between System Controller and Subclouds (r9, dsr8MR3)
Added a new section for rehoming subcloud with expired certificates Story: 2010815 Task: 49748 Change-Id: Icb523fc50ada181d44caab46dcd7e9b30e0bc32c Signed-off-by: Ngairangbam Mili <ngairangbam.mili@windriver.com>
This commit is contained in:
parent
ecef035a3a
commit
4d5177f95f
@ -0,0 +1,2 @@
|
||||
.. licenseexpirationalarm-begin
|
||||
.. licenseexpirationalarm-end
|
@ -58,6 +58,7 @@ Operation
|
||||
delete-subcloud-backup-data-using-dcmanager-cli-9cabe48bc4fd
|
||||
restore-a-subcloud-group-of-subclouds-from-backup-data-using-dcmanager-cli-f10c1b63a95e
|
||||
rehoming-a-subcloud
|
||||
rehoming-subcloud-with-expired-certificates-00549c4ea6e2
|
||||
rename-subcloud-e303565e7192
|
||||
prestage-a-subcloud-using-dcmanager-df756866163f
|
||||
add-a-horizon-keystone-user-to-distributed-cloud-29655b0f0eb9
|
||||
|
@ -0,0 +1,205 @@
|
||||
.. _rehoming-subcloud-with-expired-certificates-00549c4ea6e2:
|
||||
|
||||
===========================================
|
||||
Rehoming Subcloud with Expired Certificates
|
||||
===========================================
|
||||
|
||||
The rehoming procedure for subcloud that has been powered off for a long period of
|
||||
time differs from the regular rehoming procedure. Depending on how long the
|
||||
subcloud has been offline, the platform certificates may expire and require regeneration.
|
||||
|
||||
If the certificates are recoverable, the rehoming playbook will automatically
|
||||
recover most of them. However, some certificates will require manual
|
||||
intervention. The playbook will fail and :command:`dcmanager subcloud errors subcloud`
|
||||
will indicate the actions that need to be taken.
|
||||
|
||||
.. rubric:: |proc|
|
||||
|
||||
#. Power on controller-0 of the subcloud.
|
||||
|
||||
.. note::
|
||||
|
||||
Ensure that you can ping the |OAM| floating IP from the new system controller
|
||||
before proceeding.
|
||||
|
||||
#. SSH to the subcloud as sysadmin. If the password has expired, a prompt will
|
||||
pop up requesting to update the sysadmin password.
|
||||
|
||||
#. Proceed with rehoming.
|
||||
|
||||
-----------------
|
||||
Multi-node system
|
||||
-----------------
|
||||
|
||||
In DX and standard subclouds (subcloud with 2 controllers), the subcloud may be
|
||||
in a continuous swact/reboot cycle that will lead to an unstable controller
|
||||
for the rehoming procedure to target.
|
||||
|
||||
- Ensure that you power off controller-1 before attempting the rehoming procedure.
|
||||
Otherwise, the playbook will fail with an error ``Certificate
|
||||
recovery in progress. Please power-off controller-1 and try again``.
|
||||
|
||||
- The rehoming playbook will run and recover the active controller of the
|
||||
subcloud, after which it will display ``Running certificate recovery on other
|
||||
nodes. Connect to the subcloud and run 'tail -f /root/ansible.log' to follow
|
||||
the logs.``. This means that another ansible process is running in the
|
||||
subcloud and you can review the log for more details.
|
||||
|
||||
- At the Running certificate recovery on other nodes step, controller-1
|
||||
should be powered on automatically. If not, a message will be written to
|
||||
``/root/ansible.log`` asking for manual intervention to power it on.
|
||||
|
||||
The following error indicates that controller-1 should be powered off first for
|
||||
subcloud active controller certificate recovery:
|
||||
|
||||
.. code-block::
|
||||
|
||||
[sysadmin@controller-0 ~(keystone_admin)]$ dcmanager subcloud errors subcloud3
|
||||
FAILED rehoming playbook of (subcloud3).
|
||||
detail: fatal: [subcloud3]: FAILED! => changed=false
|
||||
msg: Certificate recovery in progress. Please power-off controller-1 and try again.
|
||||
FAILED TASK: TASK [common/recover-subcloud-certificates : Fail if controller-1 is running] Thursday 15 March 2035 00:01:03 +0000 (0:00:00.439) 0:00:08.467
|
||||
|
||||
If you get this error, turn off controller-1 and try again.
|
||||
|
||||
-----------------------------
|
||||
Manually Managed Certificates
|
||||
-----------------------------
|
||||
|
||||
Manual certificates are those that are manually installed by the user using the
|
||||
:command:`system certificate-install` command. Examples include the StarlingX
|
||||
REST API & Horizon Server certificate and Local Registry Server certificate.
|
||||
It is not possible to automatically recover manual certificates.
|
||||
|
||||
As automatic recovery is not possible, the rehoming procedure will fail and ask
|
||||
for manual intervention:
|
||||
|
||||
.. code-block::
|
||||
|
||||
[sysadmin@controller-0 ~(keystone_admin)]$ dcmanager subcloud errors subcloud3
|
||||
FAILED rehoming playbook of (subcloud3).
|
||||
detail: fatal: [subcloud3]: FAILED! => changed=false
|
||||
msg: |-
|
||||
Rest API and Docker Registry certificates are expired. Manual action required! On the subcloud, please update the expired certificates with `system certificate-install` and then run "dcmanager subcloud delete" and "dcmanager subcloud add" again to restart the procedure.
|
||||
TASK [common/recover-subcloud-certificates : Delete root ca key file after use in compute nodes] ***
|
||||
Wednesday 14 March 2035 22:52:22 +0000 (0:00:00.026) 0:03:12.115 *******
|
||||
skipping: [subcloud3]
|
||||
|
||||
If you get this error, generate new certificates for the aforementioned
|
||||
certificates, install them with certificate-install, and try again.
|
||||
|
||||
.. note::
|
||||
|
||||
This will not be required if the certificates are already managed by cert-manager.
|
||||
|
||||
--------------------------------------------------
|
||||
Cert-manager Certificates using a Custom CA Issuer
|
||||
--------------------------------------------------
|
||||
|
||||
If you are using a Cert-manager Issuer other than ``system-local-ca`` for platform
|
||||
certificates, you will get the following error:
|
||||
|
||||
.. code-block::
|
||||
|
||||
[sysadmin@controller-0 dc-config(keystone_admin)]$ dcmanager subcloud error subcloud1
|
||||
FAILED rehoming playbook of (subcloud1).
|
||||
detail: fatal: [subcloud1]: FAILED! => changed=false
|
||||
msg: Cert-manager certificate(s) with their issuer expired. Please verify secret(s)
|
||||
deployment/cloudplatform-rootca-secret on the subcloud, manually update and try
|
||||
again."
|
||||
TASK [common/recover-subcloud-certificates : Delete root ca key file after use in compute nodes] ***
|
||||
Saturday 03 March 2035 18:56:00 +0000 (0:00:00.042) 0:02:42.799 ********
|
||||
skipping: [subcloud1]
|
||||
FAILED TASK: TASK [common/recover-subcloud-certificates : Delete root ca key file after use in compute nodes] Saturday 03 March 2035 18:56:00 +0000 (0:00:00.042)
|
||||
0:02:42.799
|
||||
|
||||
In this case, manual update of the underlying Issuer's secret will be necessary.
|
||||
|
||||
As an example, the above error mentions deployment/cloudplatform-rootca-secret,
|
||||
where deployment is the K8s namespace and cloudplatform-rootca-secret is the secret name.
|
||||
To update the |CA| certificate in this secret, use the following commands:
|
||||
|
||||
.. code-block::
|
||||
|
||||
kubectl -n deployment delete secret cloudplatform-rootca-secret
|
||||
kubectl -n deployment create secret tls cloudplatform-rootca-secret --key=./ca.key --cert=./ca.crt
|
||||
rm ca.crt ca.key
|
||||
|
||||
``ca.crt`` and ``ca.key`` are in pem format. They can be obtained from the
|
||||
security personnel or the team responsible for certificate management.
|
||||
|
||||
---------------------------
|
||||
Management Affecting Alarms
|
||||
---------------------------
|
||||
|
||||
Once the certificate recovery process is completed, the subclouds should be free of
|
||||
management affecting alarms. The management affecting alarms will cause the rehoming
|
||||
procedure to fail. The subcloud may still be recoverable and the alarms should
|
||||
indicate the condition and provide information on the next step.
|
||||
|
||||
.. code-block::
|
||||
|
||||
[sysadmin@controller-0 ~(keystone_admin)]$ dcmanager subcloud errors subcloud3
|
||||
FAILED rehoming playbook of (subcloud3).
|
||||
detail: fatal: [subcloud3]: FAILED! => changed=false
|
||||
msg: The subcloud has management affecting alarms which are blocking the rehoming
|
||||
procedure from continuing. The subcloud may still be recoverable, connect to it and
|
||||
run "fm alarm-list --mgmt_affecting" to check the alarms. Please resolve the alarm
|
||||
condition(s) then try again.
|
||||
TASK [common/recover-subcloud-certificates : Delete root ca key file after use in compute nodes] ***
|
||||
Wednesday 14 March 2035 23:45:44 +0000 (0:00:00.020) 0:42:53.295 *******
|
||||
skipping: [subcloud3]
|
||||
FAILED TASK: TASK [common/recover-subcloud-certificates : Delete root ca key file
|
||||
after use in compute nodes] Wednesday 14 March 2035 23:45:44 +0000 (0:00:00.020)
|
||||
0:42:53.295
|
||||
|
||||
In this case, review the active alarms and take the necessary actions to resolve them.
|
||||
|
||||
.. only:: partner
|
||||
|
||||
.. include:: /_includes/rehoming-subcloud-with-expired-certificates.rest
|
||||
:start-after: licenseexpirationalarm-begin
|
||||
:end-before: licenseexpirationalarm-end
|
||||
|
||||
-------------------
|
||||
SSL CA Certificates
|
||||
-------------------
|
||||
|
||||
SSL CA certificates are not automatically recovered as part of the rehoming procedure.
|
||||
|
||||
After a successful rehoming, an alarm will be raised by the system to let users
|
||||
know about the expiration of SSL CA certificates:
|
||||
|
||||
.. code-block::
|
||||
|
||||
[sysadmin@controller-0 ~(keystone_admin)]$ fm alarm-list
|
||||
+----------+----------------------------------------------------------------------------------------------------------+--------------------------------------+----------+---------------------+
|
||||
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
|
||||
+----------+----------------------------------------------------------------------------------------------------------+--------------------------------------+----------+---------------------+
|
||||
| 500.210 | Certificate 'system certificate-show 9062a088-8c71-46c6-b194-6a65908f1080' (mode=ssl_ca) expired. | system.certificate.mode=ssl_ca.uuid= | critical | 2035-03-19T23:50:22 |
|
||||
| | | 9062a088-8c71-46c6-b194-6a65908f1080 | | .917781 |
|
||||
+----------+----------------------------------------------------------------------------------------------------------+--------------------------------------+----------+---------------------+
|
||||
|
||||
The alarm indicates that the certificate has expired. For more information
|
||||
about the certificate, run ``sudo show-certs.sh``. The following are the two
|
||||
possible resolutions:
|
||||
|
||||
- The certificate is no longer needed
|
||||
|
||||
.. code-block::
|
||||
|
||||
system certificate-list | grep ssl_ca
|
||||
system certificate-uninstall -m ssl_ca <expired_certificate_uuid>
|
||||
|
||||
- The certificate is needed
|
||||
|
||||
.. code-block::
|
||||
|
||||
system certificate-list | grep ssl_ca
|
||||
system certificate-uninstall -m ssl_ca <expired_certificate_uuid>
|
||||
|
||||
Obtain and install the new version of the required certificate:
|
||||
|
||||
.. code-block::
|
||||
|
||||
system certificate-install -m ssl_ca <new_ssl_ca>
|
Loading…
x
Reference in New Issue
Block a user