Configure tox+content to fetch event and convert alarms and logs to rst for use in build. Handle non-existant tmp dir in zuul builds Add static events.yaml for CI/CD testingx Generalize label construction to prevent namespace conflicts Consume events directly from fm repo (required changes merged) Update logs template for legibility. Add clean up for temporary rst files. Point parser at dynamically downloaded events file Restore logs template Note: This review deletes static alarm and log files Note: This review excludes alarm files from git as they are now build-time temp files. Note: This review uses a static copy of events.yaml to pass tox until the dep. below is met. It will need reconfiguration at that time. Depends-On: https://review.opendev.org/c/starlingx/fault/+/863574 Signed-off-by: Ron Stone <ronald.stone@windriver.com> Change-Id: I0bb8d0a77b9d3cf22b33f8930c569b3e70b7291c
7.1 KiB
Expiring-Soon and Expired Certificate Alarms
Expired certificates may prevent the proper operation of platform and applications running on the platform. In order to avoid expired certificates, generates alarms for certificates that are within 30 days (default) of expiry or have already expired.
This functionality is enabled by default for all platform and user-installed certificates that are approaching their respective expiry dates. User-override options are available for customizing the alarm behavior.
The two types of certificate alarms are:
Expiring Soon
(alarm ID: 500.200, severity: major); by default raised 30 days prior to expiry of the certificate.Expired
(alarm ID: 500.210, severity: critical).
Note
Certificates are checked every 24 hours to raise an Expiring-Soon or Expired alarm and alarms may not occur at precise 24 hour multiples of the times they were set.
[sysadmin@controller-0 ~(keystone_admin)]$ fm alarm-list
+----------+------------------------------------------------------------------------------------------+--------------------------------------+----------+------------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+------------------------------------------------------------------------------------------+--------------------------------------+----------+------------------+
| 500.200 | Certificate 'system certificate-show 89b332d9-d590-4447-bf5a-6edc61c2d0e4' (mode=ssl_ca) | system.certificate.mode=ssl_ca.uuid= | major | 2021-10-08T15:34 |
| | is expiring soon on 2021-10-15, 00:00:00 | 89b332d9-d590-4447-bf5a-6edc61c2d0e4 | | :49.451107 |
| | | | | |
| 400.001 | Service group controller-services degraded; cert-alarm(enabled-active, ) | service_domain=controller. | major | 2021-10-08T15:34 |
| | | service_group=controller-services. | | :27.494473 |
| | | host=controller-0 | | |
| | | | | |
| 100.103 | Memory threshold exceeded ; threshold 80.00%, actual 81.12% | host=controller-0.memory=platform | major | 2021-10-08T00:21 |
| | | | | :25.237489 |
| | | | | |
+----------+------------------------------------------------------------------------------------------+--------------------------------------+----------+------------------+
The platform monitors the following resources to track and audit certificate expiry dates:
All type secrets in all Kubernetes namespaces.
This includes secrets that you create directly or secrets that are indirectly created by configuring a Cert-Manager certificate.
All certificates installed on the platform via the
system certificate-install
command.Other internal certificates required by the platform such as Kubernetes RootCA, Etcd RootCA etc.
Note
For certificates managed by cert-manager, the expiring soon alarm is not generated unless the certificate's
renewBefore
date is past. In this way, alarms for certificates auto-renewed by cert-manager, will only occur if the renew failed.
Overriding Default Certificate Alarming Behavior
For certificates that exist under the Kubernetes domain, Kube Annotations can be used to override the default certificate alarming behavior. All other certificate types only support default certificate alarming behavior and cannot be overridden.
Note
If you added a certificate by directly creating a Kubernetes Secret, the annotation should be added to that Kubernetes Secret resource. If the Secret was indirectly created by configuring a Cert-Manager certificate resource, the annotation should be added to the certificate resource.
The supported annotations are:
starlingx.io/alarm: <enabled | disabled>
(default=enabled)starlingx.io/alarm-before: <days>
(default=30d)starlingx.io/alarm-severity: <critical/major/minor>
starlingx.io/alarm-text: <pre-text in alarm description>
If the system-restapi-gui-certificate
has been
configured to install the StarlingX RESTAPI / Webserver certificate to
be managed by Cert-Manager, the default annotations can be edited:
Open the current configuration:
$ kubectl edit certificate system-restapi-gui-certificate -n deployment
Make the following configuration changes:
metadata: annotations: starlingx.io/alarm: enabled starlingx.io/alarm-before: 15d starlingx.io/alarm-severity: minor starlingx.io/alarm-text: “webserverAPI certificate”
These override settings cause the
system-restapi-gui-certificate
resource to be monitored via
the alarm: enabled
annotation. An alarm will be raised 15
days before the certificate expiry if the certificate is soon-to-expire
or has expired with a minor severity. The alarm text will be prefixed
with the string webserverAPI certificate
, resulting in
webserverAPI certificate namespace=deployment.certificate=system-restapi-gui-certificate is expiring soon on <date>
.
Corrective action
When a certificate alarm occurs, the resource should be updated in
order to clear the alarm. If the certificate was installed via the system
certificate-install
command, a new certificate needs to be
obtained and re-installed. For certificates that are managed by
Cert-Manager, the certificates will auto-renew provided there are no
configuration errors; list issues with cert-manager auto-renewal of a
certificate with kubectl
-n <namespace> describe certificate <certname>
.
Note
It may take up to one hour for an active alarm to clear after corrective action has been taken.
500-series-alarm-messages-starlingx