Long Latency Between System Controller and Subclouds (r9, dsr8MR3)

Added a new section for rehoming subcloud with expired certificates Story: 2010815 Task: 49748 Change-Id: Icb523fc50ada181d44caab46dcd7e9b30e0bc32c Signed-off-by: Ngairangbam Mili <ngairangbam.mili@windriver.com>
2024-03-21 15:36:07 +00:00 · 2024-03-21 15:36:07 +00:00 · 4d5177f95f
commit 4d5177f95f
parent ecef035a3a
3 changed files with 208 additions and 0 deletions
--- a/doc/source/_includes/rehoming-subcloud-with-expired-certificates.rest
+++ b/doc/source/_includes/rehoming-subcloud-with-expired-certificates.rest
@ -0,0 +1,2 @@
+.. licenseexpirationalarm-begin
+.. licenseexpirationalarm-end
--- a/doc/source/dist_cloud/kubernetes/index-dist-cloud-kub-95bef233eef0.rst
+++ b/doc/source/dist_cloud/kubernetes/index-dist-cloud-kub-95bef233eef0.rst
@ -58,6 +58,7 @@ Operation
    delete-subcloud-backup-data-using-dcmanager-cli-9cabe48bc4fd
    restore-a-subcloud-group-of-subclouds-from-backup-data-using-dcmanager-cli-f10c1b63a95e
    rehoming-a-subcloud
+    rehoming-subcloud-with-expired-certificates-00549c4ea6e2
    rename-subcloud-e303565e7192
    prestage-a-subcloud-using-dcmanager-df756866163f
    add-a-horizon-keystone-user-to-distributed-cloud-29655b0f0eb9
--- a/doc/source/dist_cloud/kubernetes/rehoming-subcloud-with-expired-certificates-00549c4ea6e2.rst
+++ b/doc/source/dist_cloud/kubernetes/rehoming-subcloud-with-expired-certificates-00549c4ea6e2.rst
@ -0,0 +1,205 @@
+.. _rehoming-subcloud-with-expired-certificates-00549c4ea6e2:
+
+===========================================
+Rehoming Subcloud with Expired Certificates
+===========================================
+
+The rehoming procedure for subcloud that has been powered off for a long period of
+time differs from the regular rehoming procedure. Depending on how long the
+subcloud has been offline, the platform certificates may expire and require regeneration.
+
+If the certificates are recoverable, the rehoming playbook will automatically
+recover most of them. However, some certificates will require manual
+intervention. The playbook will fail and :command:`dcmanager subcloud errors subcloud`
+will indicate the actions that need to be taken.
+
+.. rubric:: |proc|
+
+#. Power on controller-0 of the subcloud.
+
+   .. note::
+
+       Ensure that you can ping the |OAM| floating IP from the new system controller
+       before proceeding.
+
+#. SSH to the subcloud as sysadmin. If the password has expired, a prompt will
+   pop up requesting to update the sysadmin password.
+
+#. Proceed with rehoming.
+
+-----------------
+Multi-node system
+-----------------
+
+In DX and standard subclouds (subcloud with 2 controllers), the subcloud may be
+in a continuous swact/reboot cycle that will lead to an unstable controller
+for the rehoming procedure to target.
+
+- Ensure that you power off controller-1 before attempting the rehoming procedure.
+  Otherwise, the playbook will fail with an error ``Certificate
+  recovery in progress. Please power-off controller-1 and try again``.
+
+- The rehoming playbook will run and recover the active controller of the
+  subcloud, after which it will display ``Running certificate recovery on other
+  nodes. Connect to the subcloud and run 'tail -f /root/ansible.log' to follow
+  the logs.``. This means that another ansible process is running in the
+  subcloud and you can review the log for more details.
+
+- At the Running certificate recovery on other nodes step, controller-1
+  should be powered on automatically. If not, a message will be written to
+  ``/root/ansible.log`` asking for manual intervention to power it on.
+
+  The following error indicates that controller-1 should be powered off first for
+  subcloud active controller certificate recovery:
+
+  .. code-block::
+
+      [sysadmin@controller-0 ~(keystone_admin)]$ dcmanager subcloud errors subcloud3
+      FAILED rehoming playbook of (subcloud3).
+       detail: fatal: [subcloud3]: FAILED! => changed=false
+        msg: Certificate recovery in progress. Please power-off controller-1 and try again.
+      FAILED TASK: TASK [common/recover-subcloud-certificates : Fail if controller-1 is running]  Thursday 15 March 2035  00:01:03 +0000 (0:00:00.439)       0:00:08.467
+
+  If you get this error, turn off controller-1 and try again.
+
+-----------------------------
+Manually Managed Certificates
+-----------------------------
+
+Manual certificates are those that are manually installed by the user using the
+:command:`system certificate-install` command. Examples include the StarlingX
+REST API & Horizon Server certificate and Local Registry Server certificate.
+It is not possible to automatically recover manual certificates.
+
+As automatic recovery is not possible, the rehoming procedure will fail and ask
+for manual intervention:
+
+.. code-block::
+
+    [sysadmin@controller-0 ~(keystone_admin)]$ dcmanager subcloud errors subcloud3
+    FAILED rehoming playbook of (subcloud3).
+     detail: fatal: [subcloud3]: FAILED! => changed=false
+      msg: |-
+        Rest API and Docker Registry certificates are expired.  Manual action required! On the subcloud, please update the expired certificates with `system certificate-install` and then run "dcmanager subcloud delete" and "dcmanager subcloud add" again to restart the procedure.
+    TASK [common/recover-subcloud-certificates : Delete root ca key file after use in compute nodes] ***
+    Wednesday 14 March 2035  22:52:22 +0000 (0:00:00.026)       0:03:12.115 *******
+    skipping: [subcloud3]
+
+If you get this error, generate new certificates for the aforementioned
+certificates, install them with certificate-install, and try again.
+
+.. note::
+
+    This will not be required if the certificates are already managed by cert-manager.
+
+--------------------------------------------------
+Cert-manager Certificates using a Custom CA Issuer
+--------------------------------------------------
+
+If you are using a Cert-manager Issuer other than ``system-local-ca`` for platform
+certificates, you will get the following error:
+
+.. code-block::
+
+    [sysadmin@controller-0 dc-config(keystone_admin)]$ dcmanager subcloud error subcloud1
+    FAILED rehoming playbook of (subcloud1).
+     detail: fatal: [subcloud1]: FAILED! => changed=false
+      msg: Cert-manager certificate(s) with their issuer expired. Please verify secret(s)
+    deployment/cloudplatform-rootca-secret on the subcloud, manually update and try
+    again."
+    TASK [common/recover-subcloud-certificates : Delete root ca key file after use in compute nodes] ***
+    Saturday 03 March 2035  18:56:00 +0000 (0:00:00.042)       0:02:42.799 ********
+    skipping: [subcloud1]
+    FAILED TASK: TASK [common/recover-subcloud-certificates : Delete root ca key file after use in compute nodes]  Saturday 03 March 2035  18:56:00 +0000 (0:00:00.042)
+     0:02:42.799
+
+In this case, manual update of the underlying Issuer's secret will be necessary.
+
+As an example, the above error mentions deployment/cloudplatform-rootca-secret,
+  where deployment is the K8s namespace and cloudplatform-rootca-secret is the secret name.
+  To update the |CA| certificate in this secret, use the following commands:
+
+  .. code-block::
+
+      kubectl -n deployment delete secret cloudplatform-rootca-secret
+      kubectl -n deployment create secret tls cloudplatform-rootca-secret  --key=./ca.key --cert=./ca.crt
+      rm ca.crt ca.key
+
+  ``ca.crt`` and ``ca.key`` are in pem format. They can be obtained from the
+  security personnel or the team responsible for certificate management.
+
+---------------------------
+Management Affecting Alarms
+---------------------------
+
+Once the certificate recovery process is completed, the subclouds should be free of
+management affecting alarms. The management affecting alarms will cause the rehoming
+procedure to fail. The subcloud may still be recoverable and the alarms should
+indicate the condition and provide information on the next step.
+
+.. code-block::
+
+    [sysadmin@controller-0 ~(keystone_admin)]$ dcmanager subcloud errors subcloud3
+    FAILED rehoming playbook of (subcloud3).
+     detail: fatal: [subcloud3]: FAILED! => changed=false
+      msg: The subcloud has management affecting alarms which are blocking the rehoming
+    procedure from continuing. The subcloud may still be recoverable, connect to it and
+    run "fm alarm-list --mgmt_affecting" to check the alarms. Please resolve the alarm
+    condition(s) then try again.
+    TASK [common/recover-subcloud-certificates : Delete root ca key file after use in compute nodes] ***
+    Wednesday 14 March 2035  23:45:44 +0000 (0:00:00.020)       0:42:53.295 *******
+    skipping: [subcloud3]
+    FAILED TASK: TASK [common/recover-subcloud-certificates : Delete root ca key file
+    after use in compute nodes]  Wednesday 14 March 2035  23:45:44 +0000 (0:00:00.020)
+    0:42:53.295
+
+In this case, review the active alarms and take the necessary actions to resolve them.
+
+.. only:: partner
+
+    .. include:: /_includes/rehoming-subcloud-with-expired-certificates.rest
+       :start-after: licenseexpirationalarm-begin
+       :end-before: licenseexpirationalarm-end
+
+-------------------
+SSL CA Certificates
+-------------------
+
+SSL CA certificates are not automatically recovered as part of the rehoming procedure.
+
+After a successful rehoming, an alarm will be raised by the system to let users
+know about the expiration of SSL CA certificates:
+
+.. code-block::
+
+    [sysadmin@controller-0 ~(keystone_admin)]$ fm alarm-list
+    +----------+----------------------------------------------------------------------------------------------------------+--------------------------------------+----------+---------------------+
+    | Alarm ID | Reason Text                                                                                              | Entity ID                            | Severity | Time Stamp          |
+    +----------+----------------------------------------------------------------------------------------------------------+--------------------------------------+----------+---------------------+
+    | 500.210  | Certificate 'system certificate-show 9062a088-8c71-46c6-b194-6a65908f1080' (mode=ssl_ca) expired.        | system.certificate.mode=ssl_ca.uuid= | critical | 2035-03-19T23:50:22 |
+    |          |                                                                                                          | 9062a088-8c71-46c6-b194-6a65908f1080 |          | .917781             |
+    +----------+----------------------------------------------------------------------------------------------------------+--------------------------------------+----------+---------------------+
+
+The alarm indicates that the certificate has expired. For more information
+about the certificate, run ``sudo show-certs.sh``. The following are the two
+possible resolutions:
+
+- The certificate is no longer needed
+
+  .. code-block::
+
+      system certificate-list | grep ssl_ca
+      system certificate-uninstall -m ssl_ca <expired_certificate_uuid>
+
+- The certificate is needed
+
+  .. code-block::
+
+      system certificate-list | grep ssl_ca
+      system certificate-uninstall -m ssl_ca <expired_certificate_uuid>
+
+  Obtain and install the new version of the required certificate:
+
+  .. code-block::
+
+      system certificate-install -m ssl_ca <new_ssl_ca>