Merge "Long Latency Between System Controller and Subclouds (r9, dsr8MR3)"

2024-04-02 23:47:41 +00:00 · 2024-04-02 23:47:41 +00:00 · 5cb9e78a32
commit 5cb9e78a32
parent 7679e2de57 4d5177f95f
3 changed files with 208 additions and 0 deletions
--- a/doc/source/_includes/rehoming-subcloud-with-expired-certificates.rest
+++ b/doc/source/_includes/rehoming-subcloud-with-expired-certificates.rest
@ -0,0 +1,2 @@
 .. licenseexpirationalarm-begin
 .. licenseexpirationalarm-end
--- a/doc/source/dist_cloud/kubernetes/index-dist-cloud-kub-95bef233eef0.rst
+++ b/doc/source/dist_cloud/kubernetes/index-dist-cloud-kub-95bef233eef0.rst
@ -58,6 +58,7 @@ Operation
    delete-subcloud-backup-data-using-dcmanager-cli-9cabe48bc4fd
    restore-a-subcloud-group-of-subclouds-from-backup-data-using-dcmanager-cli-f10c1b63a95e
    rehoming-a-subcloud
    rehoming-subcloud-with-expired-certificates-00549c4ea6e2
    rename-subcloud-e303565e7192
    prestage-a-subcloud-using-dcmanager-df756866163f
    add-a-horizon-keystone-user-to-distributed-cloud-29655b0f0eb9
--- a/doc/source/dist_cloud/kubernetes/rehoming-subcloud-with-expired-certificates-00549c4ea6e2.rst
+++ b/doc/source/dist_cloud/kubernetes/rehoming-subcloud-with-expired-certificates-00549c4ea6e2.rst
@ -0,0 +1,205 @@
 .. _rehoming-subcloud-with-expired-certificates-00549c4ea6e2:
 ===========================================
 Rehoming Subcloud with Expired Certificates
 ===========================================
 The rehoming procedure for subcloud that has been powered off for a long period of
 time differs from the regular rehoming procedure. Depending on how long the
 subcloud has been offline, the platform certificates may expire and require regeneration.
 If the certificates are recoverable, the rehoming playbook will automatically
 recover most of them. However, some certificates will require manual
 intervention. The playbook will fail and :command:`dcmanager subcloud errors subcloud`
 will indicate the actions that need to be taken.
 .. rubric:: |proc|
 #. Power on controller-0 of the subcloud.
   .. note::
       Ensure that you can ping the |OAM| floating IP from the new system controller
       before proceeding.
 #. SSH to the subcloud as sysadmin. If the password has expired, a prompt will
   pop up requesting to update the sysadmin password.
 #. Proceed with rehoming.
 -----------------
 Multi-node system
 -----------------
 In DX and standard subclouds (subcloud with 2 controllers), the subcloud may be
 in a continuous swact/reboot cycle that will lead to an unstable controller
 for the rehoming procedure to target.
 - Ensure that you power off controller-1 before attempting the rehoming procedure.
  Otherwise, the playbook will fail with an error ``Certificate
  recovery in progress. Please power-off controller-1 and try again``.
 - The rehoming playbook will run and recover the active controller of the
  subcloud, after which it will display ``Running certificate recovery on other
  nodes. Connect to the subcloud and run 'tail -f /root/ansible.log' to follow
  the logs.``. This means that another ansible process is running in the
  subcloud and you can review the log for more details.
 - At the Running certificate recovery on other nodes step, controller-1
  should be powered on automatically. If not, a message will be written to
  ``/root/ansible.log`` asking for manual intervention to power it on.
  The following error indicates that controller-1 should be powered off first for
  subcloud active controller certificate recovery:
  .. code-block::
      [sysadmin@controller-0 ~(keystone_admin)]$ dcmanager subcloud errors subcloud3
      FAILED rehoming playbook of (subcloud3).
       detail: fatal: [subcloud3]: FAILED! => changed=false
        msg: Certificate recovery in progress. Please power-off controller-1 and try again.
      FAILED TASK: TASK [common/recover-subcloud-certificates : Fail if controller-1 is running]  Thursday 15 March 2035  00:01:03 +0000 (0:00:00.439)       0:00:08.467
  If you get this error, turn off controller-1 and try again.
 -----------------------------
 Manually Managed Certificates
 -----------------------------
 Manual certificates are those that are manually installed by the user using the
 :command:`system certificate-install` command. Examples include the StarlingX
 REST API & Horizon Server certificate and Local Registry Server certificate.
 It is not possible to automatically recover manual certificates.
 As automatic recovery is not possible, the rehoming procedure will fail and ask
 for manual intervention:
 .. code-block::
    [sysadmin@controller-0 ~(keystone_admin)]$ dcmanager subcloud errors subcloud3
    FAILED rehoming playbook of (subcloud3).
     detail: fatal: [subcloud3]: FAILED! => changed=false
      msg: |-
        Rest API and Docker Registry certificates are expired.  Manual action required! On the subcloud, please update the expired certificates with `system certificate-install` and then run "dcmanager subcloud delete" and "dcmanager subcloud add" again to restart the procedure.
    TASK [common/recover-subcloud-certificates : Delete root ca key file after use in compute nodes] ***
    Wednesday 14 March 2035  22:52:22 +0000 (0:00:00.026)       0:03:12.115 *******
    skipping: [subcloud3]
 If you get this error, generate new certificates for the aforementioned
 certificates, install them with certificate-install, and try again.
 .. note::
    This will not be required if the certificates are already managed by cert-manager.
 --------------------------------------------------
 Cert-manager Certificates using a Custom CA Issuer
 --------------------------------------------------
 If you are using a Cert-manager Issuer other than ``system-local-ca`` for platform
 certificates, you will get the following error:
 .. code-block::
    [sysadmin@controller-0 dc-config(keystone_admin)]$ dcmanager subcloud error subcloud1
    FAILED rehoming playbook of (subcloud1).
     detail: fatal: [subcloud1]: FAILED! => changed=false
      msg: Cert-manager certificate(s) with their issuer expired. Please verify secret(s)
    deployment/cloudplatform-rootca-secret on the subcloud, manually update and try
    again."
    TASK [common/recover-subcloud-certificates : Delete root ca key file after use in compute nodes] ***
    Saturday 03 March 2035  18:56:00 +0000 (0:00:00.042)       0:02:42.799 ********
    skipping: [subcloud1]
    FAILED TASK: TASK [common/recover-subcloud-certificates : Delete root ca key file after use in compute nodes]  Saturday 03 March 2035  18:56:00 +0000 (0:00:00.042)
     0:02:42.799
 In this case, manual update of the underlying Issuer's secret will be necessary.
 As an example, the above error mentions deployment/cloudplatform-rootca-secret,
  where deployment is the K8s namespace and cloudplatform-rootca-secret is the secret name.
  To update the |CA| certificate in this secret, use the following commands:
  .. code-block::
      kubectl -n deployment delete secret cloudplatform-rootca-secret
      kubectl -n deployment create secret tls cloudplatform-rootca-secret  --key=./ca.key --cert=./ca.crt
      rm ca.crt ca.key
  ``ca.crt`` and ``ca.key`` are in pem format. They can be obtained from the
  security personnel or the team responsible for certificate management.
 ---------------------------
 Management Affecting Alarms
 ---------------------------
 Once the certificate recovery process is completed, the subclouds should be free of
 management affecting alarms. The management affecting alarms will cause the rehoming
 procedure to fail. The subcloud may still be recoverable and the alarms should
 indicate the condition and provide information on the next step.
 .. code-block::
    [sysadmin@controller-0 ~(keystone_admin)]$ dcmanager subcloud errors subcloud3
    FAILED rehoming playbook of (subcloud3).
     detail: fatal: [subcloud3]: FAILED! => changed=false
      msg: The subcloud has management affecting alarms which are blocking the rehoming
    procedure from continuing. The subcloud may still be recoverable, connect to it and
    run "fm alarm-list --mgmt_affecting" to check the alarms. Please resolve the alarm
    condition(s) then try again.
    TASK [common/recover-subcloud-certificates : Delete root ca key file after use in compute nodes] ***
    Wednesday 14 March 2035  23:45:44 +0000 (0:00:00.020)       0:42:53.295 *******
    skipping: [subcloud3]
    FAILED TASK: TASK [common/recover-subcloud-certificates : Delete root ca key file
    after use in compute nodes]  Wednesday 14 March 2035  23:45:44 +0000 (0:00:00.020)
    0:42:53.295
 In this case, review the active alarms and take the necessary actions to resolve them.
 .. only:: partner
    .. include:: /_includes/rehoming-subcloud-with-expired-certificates.rest
       :start-after: licenseexpirationalarm-begin
       :end-before: licenseexpirationalarm-end
 -------------------
 SSL CA Certificates
 -------------------
 SSL CA certificates are not automatically recovered as part of the rehoming procedure.
 After a successful rehoming, an alarm will be raised by the system to let users
 know about the expiration of SSL CA certificates:
 .. code-block::
    [sysadmin@controller-0 ~(keystone_admin)]$ fm alarm-list
    +----------+----------------------------------------------------------------------------------------------------------+--------------------------------------+----------+---------------------+
    | Alarm ID | Reason Text                                                                                              | Entity ID                            | Severity | Time Stamp          |
    +----------+----------------------------------------------------------------------------------------------------------+--------------------------------------+----------+---------------------+
    | 500.210  | Certificate 'system certificate-show 9062a088-8c71-46c6-b194-6a65908f1080' (mode=ssl_ca) expired.        | system.certificate.mode=ssl_ca.uuid= | critical | 2035-03-19T23:50:22 |
    |          |                                                                                                          | 9062a088-8c71-46c6-b194-6a65908f1080 |          | .917781             |
    +----------+----------------------------------------------------------------------------------------------------------+--------------------------------------+----------+---------------------+
 The alarm indicates that the certificate has expired. For more information
 about the certificate, run ``sudo show-certs.sh``. The following are the two
 possible resolutions:
 - The certificate is no longer needed
  .. code-block::
      system certificate-list | grep ssl_ca
      system certificate-uninstall -m ssl_ca <expired_certificate_uuid>
 - The certificate is needed
  .. code-block::
      system certificate-list | grep ssl_ca
      system certificate-uninstall -m ssl_ca <expired_certificate_uuid>
  Obtain and install the new version of the required certificate:
  .. code-block::
      system certificate-install -m ssl_ca <new_ssl_ca>
		`@ -0,0 +1,2 @@`
							`.. licenseexpirationalarm-begin`
							`.. licenseexpirationalarm-end`