docs/doc/source/releasenotes/index.rst

1598 lines
59 KiB
ReStructuredText

.. This release note was created to address review https://review.opendev.org/c/starlingx/docs/+/862596
.. The Release Notes will be updated and a separate gerrit review will be sent out
.. Ignore the contents in this RN except for the updates stated in the comment above
.. _release-notes:
.. The Stx 10.0 RN is WIP and not ready for review.
.. Removed appearances of Armada as its not supported
===================
R10.0 Release Notes
===================
.. rubric:: |context|
StarlingX is a fully integrated edge cloud software stack that provides
everything needed to deploy an edge cloud on one, two, or up to 100 servers.
This section describes the new capabilities, Known Limitations and Workarounds,
Defects fixed and deprecated information in StarlingX 9.0 Release.
.. contents::
:local:
:depth: 1
---------
ISO image
---------
The pre-built ISO (Debian) for StarlingX Release 9.0 is located at the
``StarlingX mirror`` repo:
https://mirror.starlingx.windriver.com/mirror/starlingx/release/9.0.0/debian/monolithic/outputs/iso/
-------------------------------------
Source Code for StarlingX Release 9.0
-------------------------------------
The source code for StarlingX Release 9.0 is available on the r/stx.9.0
branch in the `StarlingX repositories <https://opendev.org/starlingx>`_.
----------
Deployment
----------
To deploy StarlingX Release 9.0, see `Consuming StarlingX <https://docs.starlingx.io/introduction/consuming.html>`_.
For detailed installation instructions, see `StarlingX 9.0 Installation Guides <https://docs.starlingx.io/deploy_install_guides/index-install-e083ca818006.html>`_.
-----------------------------
New Features and Enhancements
-----------------------------
.. start-new-features-r9
The sections below provide a detailed list of new features and links to the
associated user guides (if applicable).
*********************
Kubernetes up-version
*********************
In StarlingX 9.0, the Kubernetes version that is supported is in the range
of v1.24 to v1.27.
****************************************
Platform Application Components Revision
****************************************
.. Need updated versions for this section wherever applicable
The following applications have been updated to a new version in StarlingX Release 9.0.
All platform application up-versions are updated to remain current and address
security vulnerabilities in older versions.
- app-sriov-fec-operator: 2.7.1
- cert-manager: 1.11.1
- metric-server: 1.0.18
- nginx-ingress-controller: 1.9.3
- oidc-dex: 2.37.0
- vault: 1.14.8
- portieris: 0.13.10
- istio: 1.19.4
- kiali: 1.75.0
******************
FluxCD Maintenance
******************
FluxCD helm-controller is upgraded from v0.27.0 to v0.35.0 and is compatible
with Helm version up to v3.12.1 and Kubernetes v1.27.3.
FluxCD source-controller is upgraded from v0.32.1 to v1.0.1 and is compatible
with Helm version up to v3.12.1 and Kubernetes v1.27.3.
****************
Helm Maintenance
****************
Helm has been upgraded to v3.12.2 in StarlingX Release 9.0.
*******************************************
Support for Silicom TimeSync Server Adaptor
*******************************************
The Silicom network adaptor provides local time sync support via a local |GNSS|
module which is based on the Intel Columbiaville device.
- ``cvl-4.10`` Silicom driver bundle
- ice driver: 1.10.1.2
- i40e driver: 2.21.12
- iavf driver: 4.6.1
.. note::
`cvl-4.10` is only recommended if the Silicom STS2 card is used.
*********************************************
Kubernetes Upgrade Optimization - AIO-Simplex
*********************************************
**Configure Kubernetes Multi-Version Upgrade Cloud Orchestration for AIO-SX**
You can configure Kubernetes multi-version upgrade orchestration strategy using
the :command:`sw-manager` command. This feature is enabled from
|prod| |k8s-multi-ver-orch-strategy-release| and is supported only for the
|AIO-SX| system.
**See**: :ref:`Configure Kubernetes Multi-Version Upgrade Cloud Orchestration for AIO-SX <configuring-kubernetes-multi-version-upgrade-orchestration-aio-b0b59a346466>`
**Manual Kubernetes Multi-Version Upgrade in AIO-SX**
|AIO-SX| now supports multi-version Kubernetes upgrades. In this model,
Kubernetes is upgraded by two or more versions after disabling applications and
then applications are enabled again. This is faster than upgrading Kubernetes
one version at a time. Also, the upgrade can be aborted and reverted to the
original version. This feature is supported only for |AIO-SX|.
**See**: :ref:`Manual Kubernetes Multi-Version Upgrade in AIO-SX <manual-kubernetes-multi-version-upgrade-in-aio-sx-13e05ba19840>`
***********************************
Platform Admin Network Introduction
***********************************
The newly introduced admin network is an optional network that is used to
monitor and control internal |prod| between the subclouds and system controllers
in a Distributed Cloud environment. This function is performed by the management
network in the absence of an admin network. However, the admin network is more
easily reconfigured to handle subnet and IP address network parameter changes
after initial configuration.
In deployment configurations, static routes from the management or admin
interface of subclouds controller nodes to the system controller's management
subnet must be present. This ensures that the subcloud comes online after deployment.
.. note::
The admin network is optional. The default management network will be used
if it is not present.
You can manage an optional admin network on a subcloud for IP connectivity to
the system controller management network where the IP addresses of the admin
network can be changed.
**See**:
- :ref:`Common Components <common-components>`
- :ref:`Manage Subcloud Network Parameters <update-a-subcloud-network-parameters-b76377641da4>`
****************************************************
L3 Firewalls for all |prod-long| Platform Interfaces
****************************************************
|prod| incorporates default firewall rules for the platform networks (|OAM|,
management, cluster-host, pxeboot, admin, and storage). You can configure
additional Kubernetes Network Policies to augment or override the default rules.
**See**:
- :ref:`Modify Firewall Options <security-firewall-options>`
- :ref:`Default Firewall Rules <security-default-firewall-rules>`
****************************************************
app-sriov-fec-operator upgrade to FEC operator 2.7.1
****************************************************
A new version of the FEC Operator v2.7.1 (for all Intel hardware accelerators)
is supported to include ``igb_uio`` along with making the accelerator resource
names configurable and enabling accelerator device configuration using
``igb_uio`` driver when secure boot is enabled in the BIOS.
.. note::
|FEC| operator is now running on the |prod| platform core.
**See**: :ref:`Configure Intel Wireless FEC Accelerators using SR-IOV FEC operator <configure-sriov-fec-operator-to-enable-hw-accelerators-for-hosted-vran-containarized-workloads>`
**************************************
Redundant System Clock Synchronization
**************************************
The ``phc2sys`` application can be configured to accept multiple source clock
inputs. The quality of these sources are compared to user-defined priority
values and the best available source is selected to set the system time.
The quality of the configured sources is continuously monitored by ``phc2sys``
application and will select a new best source if the current source degrades
or if another source becomes higher quality.
**See**: :ref:`Redundant System Clock Synchronization <redundant-system-clock-synchronization-89ee23f54fbb>`.
*******************************************************
Configure Intel E810 NICs using Intel Ethernet Operator
*******************************************************
You can install and use **Intel Ethernet** operator to orchestrate and manage
the configuration and capabilities provided by Intel E810 Series network
interface cards (NICs).
**See**: :ref:`Configure Intel E810 NICs using Intel Ethernet Operator <configure-intel-e810-nics-using-intel-ethernet-operator>`.
****************
AppArmor Support
****************
AppArmor is a Mandatory Access Control (MAC) system built on Linux's LSM (Linux
Security Modules) interface. In practice, the kernel queries AppArmor before
each system call to know whether the process is authorized to do the given
operation. Through this mechanism, AppArmor confines programs to a limited set
of resources.
AppArmor helps administrators in running a more secure kubernetes deployment
by restricting what operations containers/pods are allowed, and/or provide better
auditing through system logs. The access needed by a container/pod is
configured through profiles tuned to allow access such as Linux capabilities,
network access, file permissions, etc.
**See**: :ref:`About AppArmor <about-apparmor-ebdab8f1ed87>`.
*****************
Support for Vault
*****************
This release re-introduces support for Vault as it was intermittently
unavailable in |prod|. The supported version vault: 1.14.8 or later /
vault-k8s: 1.2.1 / helm-chart: 0.25.0 after the helm-v3 up-version to 3.6+
|prod| integrates open source Vault containerized security application
(Optional) into the |prod| solution, that requires |PVCs| as a storage
backend to be enabled.
**See**: :ref:`Vault Overview <security-vault-overview>`.
*********************
Support for Portieris
*********************
|prod| now supports version 0.13.10. Portieris is an open source Kubernetes
admission controller which ensures only policy-compliant images, such as signed
images from trusted registries, can run. The Portieris application uses images
from the ``icr.io registry``. You must configure service parameters for the
``icr.io registry`` prior to applying the Portieris application,
see: :ref:`About Changing External Registries for StarlingX Installation <about-changing-external-registries-for-starlingx-installation>`.
For Distributed Cloud deployments, the images must be present on the System
Controller registry.
**See**: :ref:`Portieris Overview <portieris-overview>`.
**************************
Configurable Power Manager
**************************
Configurable Power Manager focuses on containerized applications that use power
profiles individually by the core and/or the application.
|prod| has the capability to regulate the frequency of the entire processor.
However, this control is primarily directed towards the classification of the
core, distinguishing between application and platform cores. Consequently, if a
user requires to control over an individual core, such as Core 10 in a
24-core CPU, adjustments must be applied to all cores collectively. In the
context of containerized operations, it becomes imperative to establish
personalized configurations. This entails assigning each container the
requisite power configuration. In essence, this involves providing specific and
individualized power configurations to each core or group of cores.
**See**: :ref:`Configurable Power Manager <configurable-power-manager-04c24b536696>`.
******************************************************
Technology Preview - Install Power Metrics Application
******************************************************
The Power Metrics app deploys two containers, cAdvisor and Telegraf that
collect metrics about hardware usage.
**See**: :ref:`Install Power Metrics Application <install-power-metrics-application-a12de3db7478>`.
*******************************************************
Install Node Feature Discovery (NFD) |prod| Application
*******************************************************
Node Feature Discovery (NFD) version 0.15.0 detects hardware features available
on each node in a kubernetes cluster and advertises those features using
Kubernetes node labels. This procedure walks you through the process of
installing the |NFD| |prod| Application.
**See**: :ref:`Install Node Feature Discovery Application <install-node-feature-discovery-nfd-starlingx-application-70f6f940bb4a>`.
****************************************************************************
Partial Disk (Transparent) Encryption Support via Software Encryption (LUKS)
****************************************************************************
A new encrypted filesystem using Linux Unified Key Setup (LUKS) is created
automatically on all hosts to store security-sensitive files. This is mounted
at '/var/luks/stx/luks_fs' and the files kept in '/var/luks/stx/luks_fs/controller'
directory are replicated between controllers.
*************************************************************
K8s API/CLI OIDC (Dex) Authentication with Local LDAP Backend
*************************************************************
|prod| offers |LDAP| commands to create and manage |LDAP| Linux groups as part
of a StarlingX local |LDAP| server (serving the local StarlingX cluster and,
in the case of Distributed Cloud, the entire Distribute Cloud System).
StarlingX provides procedures to configure the **oidc-auth-apps** |OIDC|
Identity Provider (Dex) system application to use the StarlingX local |LDAP|
server (in addition to, or in place of the already supported remote Windows
Active Directory) to authenticate users of the Kubernetes API.
**See**:
- :ref:`Overview of LDAP Servers <overview-of-ldap-servers>`
- :ref:`Create LDAP Linux Groups <create-ldap-linux-groups-4c94045f8ee0>`
- :ref:`Configure Kubernetes Client Access <configure-kubernetes-client-access>`
************************
Create LDAP Linux Groups
************************
|prod| offers |LDAP| commands to create and manage |LDAP| Linux groups as part
of the `ldapscripts` library.
*****************************************
StarlingX OpenStack now supports Antelope
*****************************************
Currently stx-openstack has been updated and now deploys OpenStack services
based on the Antelope release.
*******************
Pod Security Policy
*******************
|PSP| ONLY applies if running on Kubernetes v1.24 or earlier. |PSP| is
deprecated as of Kubernetes v1.21 and is removed in Kubernetes v1.25.
Instead of using |PSP|, you can enforce similar restrictions on Pods using
:ref:`Pod Security Admission Controller <pod-security-admission-controller-8e9e6994100f>`.
Since it has been introduced |PSP| has had usability problems. The way |PSPs|
are applied to pods has proven confusing especially when trying to use them.
It is easy to accidentally grant broader permissions than intended, and
difficult to inspect which |PSPs| apply in a certain situation. Kubernetes
offers a built-in |PSA| controller that will replace |PSPs| in the future.
*************************************************
|WAD| users sudo and local linux group assignment
*************************************************
StarlingX 9.0 supports and provides procedures for centrally configured
Window Active Directory (WAD) Users with sudo access and local linux group
assignments; i.e. with only |WAD| configuration changes.
**See**:
- :ref:`Create LDAP Linux Accounts <create-ldap-linux-accounts>`
- :ref:`Local LDAP Certificates <local-ldap-certificates-4e1df1e39341>`
- :ref:`SSH User Authentication using Windows Active Directory <sssd-support-5fb6c4b0320b>`
*******************************************
Subcloud Error Root Cause Correction Action
*******************************************
This feature provides a root cause analysis of the subcloud
deployment / upgrade failure. This includes:
- existing 'deploy_status' that provides progress through phases of subcloud
deployment and, on error, the phase that failed
- introduces ``deploy_error_desc`` attribute that provides a summary of the
key deployment/upgrade errors
- Additional text that is added at the end of the 'deploy_error_desc' error
message, with information on:
- trouble shooting commands
- root cause of the errors and
- suggested recovery action
**See**: :ref:`Manage Subclouds Using the CLI <managing-subclouds-using-the-cli>`
************************************
Patch Orchestration Phase Operations
************************************
The distributed cloud patch orchestration has the option to separate the upload
from the apply, remove, install and reboot operations. This facilitates
performing the upload operations outside of the system maintenance window
to reduce the total execution time during the patch activation that occurs
during the maintenance window. With the separation of operations, systems can
be prestaged with the updates prior to applying the changes to the system.
**See**: :ref:`Distributed Cloud Guide <index-dist-cloud-kub-95bef233eef0>`
****************************************************
Long Latency Between System Controller and Subclouds
****************************************************
Rehoming procedure of a subcloud that has been powered off for a long period of
time will differ from the regular rehoming procedure. Based on how long the
subcloud has been offline, the platform certificates will expire and will
need to be regenerated.
**See**: :ref:`Rehoming Subcloud with Expired Certificates <rehoming-subcloud-with-expired-certificates-00549c4ea6e2>`
**************
GEO Redundancy
**************
|prod| may be deployed across a geographically distributed set of regions. A
region consists of a local Kubernetes cluster with local redundancy and access
to high-bandwidth, low-latency networking between hosts within that region.
|prod-long| Distributed Cloud GEO redundancy configuration supports the ability
to recover from a catastrophic event that requires subclouds to be rehomed away
from the failed system controller site to the available site(s) which have
enough spare capacity. This way, even if the failed site cannot be restored in
short time, the subclouds can still be rehomed to available peer system
controller(s) for centralized management.
In this release, the following items are addressed:
* 1+1 GEO redundancy
- Active-Active redundancy model
- Total number of subclouds should not exceed 1K
* Automated operations
- Synchronization and liveness check between peer systems
- Alarm generation if peer system controller is down
* Manual operations
- Batch rehoming from alive peer system controller
**See**: :ref:`GEO Redundancy <overview-of-distributed-cloud-geo-redundancy>`
********************************
Redfish Virtual Media Robustness
********************************
Redfish virtual media operations has been observed to frequently fail with
transient errors. While the conditions for those failures are not always known
(network, BMC timeouts, etc), it has been observed that if the Subcloud install
operation is retried, the operation is successful.
To alleviate the transient conditions, the robustness of the Redfish virtual
media controller (RVMC) is improved by introducing additional error
handling and retry attempts.
**See**: :ref:`Install a Subcloud Using Redfish Platform Management Service <installing-a-subcloud-using-redfish-platform-management-service>`
.. end-new-features-r9
----------------
Hardware Updates
----------------
**See**:
- :ref:`Kubernetes Verified Commercial Hardware <verified-commercial-hardware>`
----------
Bug status
----------
**********
Fixed bugs
**********
This release provides fixes for a number of defects. Refer to the StarlingX bug
database to review the R9.0 `Fixed Bugs <https://bugs.launchpad.net/starlingx/+bugs?field.searchtext=&orderby=-importance&field.status%3Alist=FIXRELEASED&assignee_option=any&field.assignee=&field.bug_reporter=&field.bug_commenter=&field.subscriber=&field.structural_subscriber=&field.tag=stx.9.0&field.tags_combinator=ANY&field.has_cve.used=&field.omit_dupes.used=&field.omit_dupes=on&field.affects_me.used=&field.has_patch.used=&field.has_branches.used=&field.has_branches=on&field.has_no_branches.used=&field.has_no_branches=on&field.has_blueprints.used=&field.has_blueprints=on&field.has_no_blueprints.used=&field.has_no_blueprints=on&search=Search>`_.
.. All please confirm if any Limitations need to be removed / added for Stx 9.0.
---------------------------------
Known Limitations and Workarounds
---------------------------------
The following are known limitations you may encounter with your |prod| Release
9.0 and earlier releases. Workarounds are suggested where applicable.
.. note::
These limitations are considered temporary and will likely be resolved in
a future release.
************************************************
Suspend/Resume on VMs with SR-IOV (direct) Ports
************************************************
When using VMs with SR-IOV ports created with the -vnic-type=direct option
after a Suspend action, if one wants to Resume the instance it might come up
with all virtual NICs created but missing the IP Address of the vNIC connected
to the SR-IOV port.
**Workaround**: Manually Power-Off and Power-On (or Hard-Reboot) the instance
and the IP should be assigned correctly again (no information is lost).
.. Cole please
*****************************************
Error on Restoring OpenStack after Backup
*****************************************
The ansible command for restoring the app will fail with |prod-long| Release 9.0
with an error message mentioning the absence of an Armada directory.
**Workaround**: Manually change the backup tarball adding the Armada directory
using the following the steps:
.. code-block:: none
tar -xzf wr_openstack_backup_file.tgz # this will create a opt directory
cp -r opt/platform/fluxcd/ opt/platform/armada # copy fluxd to armada
tar -czf new_wr-openstack_backu.tgz opt/ # tar the opt directory into a new backup tarball
*****************************************
Subcloud Upgrade with Kubernetes Versions
*****************************************
Subcloud Kubernetes versions are upgraded along with the System Controller.
You can add a new subcloud while the System Controller is on intermediate
versions of Kubernetes as long as the needed k8s images are available at the
configured sources.
**Workaround**: In a Distributed Cloud configuration, when upgrading from
|prod-long| Release 7.0 the Kubernetes version is v1.23.1. The default
version of the new install for Kubernetes is v1.24.4. Kubernetes must be
upgraded one version at a time on the System Controller.
.. note::
New subclouds should not be added until the System Controller has been
upgraded to Kubernetes v1.24.4.
****************************************************
AIO-SX Restore Fails during puppet-manifest-apply.sh
****************************************************
Restore fails using a backup file created after a fresh install.
**Workaround**: During the restore process, after reinstalling the controllers,
the |OAM| interface must be configured with the same IP address protocol version
used during installation.
**************************************************************************
Subcloud Controller-0 is in a degraded state after upgrade and host unlock
**************************************************************************
During an upgrade orchestration of the subcloud from |prod-long| Release 7.0
to |prod-long| Release 8.0, and after host unlock, the subcloud is in a
``degraded`` state, and alarm 200.004 is raised, displaying
"controller-0 experienced a service-affecting failure. Auto-recovery in progress".
**Workaround**: You can recover the subcloud to the ``available`` state by
locking and unlocking controller-0 .
***********************************************************************
Limitations when using Multiple Driver Versions for the same NIC Family
***********************************************************************
The capability to support multiple NIC driver versions has the following
limitations:
- Intel NIC family supports only: ice, i40e and iavf drivers
- Driver versions must respect the compatibility matrix between drivers
- Multiple driver versions cannot be loaded simultaneously and applies to the
entire system.
- Latest driver version will be loaded by default, unless specifically
configured to use a legacy driver version.
- Drivers used by the installer will always use the latest version,
therefore firmware compatibility must support basic NIC operations for each
version to facilitate installation
- Host reboot is required to activate the configured driver versions
- For Backup and Restore, the host must be rebooted a second time for
in order to activate the drivers versions.
**Workaround**: NA
*****************
Quartzville Tools
*****************
The following :command:`celo64e` and :command:`nvmupdate64e` commands are not
supported in |prod-long|, Release 8.0 due to a known issue in Quartzville
tools that crashes the host.
**Workaround**: Reboot the host using the boot screen menu.
*************************************************
Controller SWACT unavailable after System Restore
*************************************************
After performing a restore of the system, the user is unable to swact the
controller.
**Workaround**: NA
*************************************************************
Intermittent Kubernetes Upgrade failure due to missing Images
*************************************************************
During a Kubernetes upgrade, the upgrade may intermittently fail when you run
:command:`system kube-host-upgrade <host> control-plane` due to the
containerd cache being cleared.
**Workaround**: If the above failure is encountered, run the following commands
on the host encountering the failure:
.. rubric:: |proc|
#. Ensure the failure is due to missing images by running ``crictl images`` and
confirming the following are not present:
.. code-block::
registry.local:9001/k8s.gcr.io/kube-apiserver:v1.24.4
registry.local:9001/k8s.gcr.io/kube-controller-manager:v1.24.4
registry.local:9001/k8s.gcr.io/kube-scheduler:v1.24.4
registry.local:9001/k8s.gcr.io/kube-proxy:v1.24.4
#. Manually pull the image into containerd cache by running the following
commands, replacing ``<admin_password>`` with your password for the admin
user.
.. code-block::
~(keystone_admin)]$ crictl pull --creds admin:<admin_password> registry.local:9001/k8s.gcr.io/kube-apiserver:v1.24.4
~(keystone_admin)]$ crictl pull --creds admin:<admin_password> registry.local:9001/k8s.gcr.io/kube-controller-manager:v1.24.4
~(keystone_admin)]$ crictl pull --creds admin:<admin_password> registry.local:9001/k8s.gcr.io/kube-scheduler:v1.24.4
~(keystone_admin)]$ crictl pull --creds admin:<admin_password> registry.local:9001/k8s.gcr.io/kube-proxy:v1.24.4
#. Ensure the images are present when running ``crictl images``. Rerun the
:command:`system kube-host-upgrade <host> control-plane`` command.
***********************************
Docker Network Bridge Not Supported
***********************************
The Docker Network Bridge, previously created by default, is removed and no
longer supported in |prod-long| Release 8.0 as the default bridge IP address
collides with addresses already in use.
As a result, docker can no longer be used for running containers. This impacts
building docker images directly on the host.
**Workaround**: Create a Kubernetes pod that has network access, log in
to the container, and build the docker images.
************************************
Impact of Kubenetes Upgrade to v1.24
************************************
In Kubernetes v1.24 support for the ``RemoveSelfLink`` feature gate was removed.
In previous releases of |prod-long| this has been set to "false" for backward
compatibility, but this is no longer an option and it is now hardcoded to "true".
**Workaround**: Any application that relies on this feature gate being disabled
(i.e. assumes the existence of the "self link") must be updated before
upgrading to Kubernetes v1.24.
*******************************************************************
Password Expiry Warning Message is not shown for LDAP user on login
*******************************************************************
In |prod-long| Release 8.0, the password expiry warning message is not shown
for LDAP users on login when the password is nearing expiry. This is due to
the ``pam-sssd`` integration.
**Workaround**: It is highly recommend that LDAP users maintain independent
notifications and update their passwords every 3 months.
The expired password can be reset by a user with root privileges using
the following command:
.. code-block::none
~(keystone_admin)]$ sudo ldapsetpasswd ldap-username
Password:
Changing password for user uid=ldap-username,ou=People,dc=cgcs,dc=local
New Password:
Retype New Password:
Successfully set password for user uid=ldap-username,ou=People,dc=cgcs,dc=local
******************************************
Console Session Issues during Installation
******************************************
After bootstrap and before unlocking the controller, if the console session times
out (or the user logs out), ``systemd`` does not work properly. ``fm, sysinv and
mtcAgent`` do not initialize.
**Workaround**: If the console times out or the user logs out between bootstrap
and unlock of controller-0, then, to recover from this issue, you must
re-install the ISO.
************************************************
PTP O-RAN Spec Compliant Timing API Notification
************************************************
.. Need the version for the .tgz tarball....Please confirm if this is applicable to stx 8.0?
- The ptp-notification <minor_version>.tgz application tarball and the corresponding
notificationservice-base:stx8.0-v2.0.2 image are not backwards compatible
with applications using the ``v1 ptp-notification`` API and the corresponding
notificationclient-base:stx.8.0-v2.0.2 image.
Backward compatibility will be provided in StarlingX Release 9.0.
.. note::
For |O-RAN| Notification support (v2 API), deploy and use the
``ptp-notification-<minor_version>.tgz`` application tarball. Instructions for this
can be found in the |prod-long| Release 8.0 documentation.
**See**:
- :ref:`install-ptp-notifications`
- :ref:`integrate-the-application-with-notification-client-sidecar`
- The ``v1 API`` only supports monitoring a single ptp4l + phc2sys instance.
**Workaround**: Ensure the system is not configured with multiple instances
when using the v1 API.
- The O-RAN Cloud Notification defines a /././sync API v2 endpoint intended to
allow a client to subscribe to all notifications from a node. This endpoint
is not supported |prod-long| Release 8.0.
**Workaround**: A specific subscription for each resource type must be
created instead.
- ``v1 / v2``
- v1: Support for monitoring a single ptp4l instance per host - no other
services can be queried/subscribed to.
- v2: The API conforms to O-RAN.WG6.O-Cloud Notification API-v02.01
with the following exceptions, that are not supported in |prod-long|
Release 8.0.
- O-RAN SyncE Lock-Status-Extended notifications
- O-RAN SyncE Clock Quality Change notifications
- O-RAN Custom cluster names
- /././sync endpoint
**Workaround**: See the respective PTP-notification v1 and v2 document
subsections for further details.
v1: https://docs.starlingx.io/api-ref/ptp-notification-armada-app/api_ptp_notifications_definition_v1.html
v2: https://docs.starlingx.io/api-ref/ptp-notification-armada-app/api_ptp_notifications_definition_v2.html
**************************************************************************
Upper case characters in host names cause issues with kubernetes labelling
**************************************************************************
Upper case characters in host names cause issues with kubernetes labelling.
**Workaround**: Host names should be lower case.
****************
Debian Bootstrap
****************
On CentOS bootstrap worked even if **dns_servers** were not present in the
localhost.yml. This does not work for Debian bootstrap.
**Workaround**: You need to configure the **dns_servers** parameter in the
localhost.yml, as long as no |FQDNs| were used in the bootstrap overrides in
the localhost.yml file for Debian bootstrap.
***********************
Installing a Debian ISO
***********************
The disks and disk partitions need to be wiped before the install.
Installing a Debian ISO may fail with a message that the system is
in emergency mode if the disks and disk partitions are not
completely wiped before the install, especially if the server was
previously running a CentOS ISO.
**Workaround**: When installing a lab for any Debian install, the disks must
first be completely wiped using the following procedure before starting
an install.
Use the following wipedisk commands to run before any Debian install for
each disk (eg: sda, sdb, etc):
.. code-block:: none
sudo wipedisk
# Show
sudo sgdisk -p /dev/sda
# Clear part table
sudo sgdisk -o /dev/sda
.. note::
The above commands must be run before any Debian install. The above
commands must also be run if the same lab is used for CentOS installs after
the lab was previously running a Debian ISO.
**********************************
Security Audit Logging for K8s API
**********************************
A custom policy file can only be created at bootstrap in ``apiserver_extra_volumes``.
If a custom policy file was configured at bootstrap, then after bootstrap the
user has the option to configure the parameter ``audit-policy-file`` to either
this custom policy file (``/etc/kubernetes/my-audit-policy-file.yml``) or the
default policy file ``/etc/kubernetes/default-audit-policy.yaml``. If no
custom policy file was configured at bootstrap, then the user can only
configure the parameter ``audit-policy-file`` to the default policy file.
Only the parameter ``audit-policy-file`` is configurable after bootstrap, so
the other parameters (``audit-log-path``, ``audit-log-maxsize``,
``audit-log-maxage`` and ``audit-log-maxbackup``) cannot be changed at
runtime.
**Workaround**: NA
**See**: :ref:`kubernetes-operator-command-logging-663fce5d74e7`.
******************************************************************
Installing subcloud with patches in Partial-Apply is not supported
******************************************************************
When a patch has been uploaded and applied, but not installed, it is in
a ``Partial-Apply`` state. If a remote subcloud is installed via Redfish
(miniboot) at this point, it will run the patched software. Any patches in this
state will be applied on the subcloud as it is installed. However, this is not
reflected in the output from the :command:`sw-patch query` command on the
subcloud.
**Workaround**: For remote subcloud install operations using the Redfish
protocol, you should avoid installing any subclouds if there are System
Controller patches in the ``Partial-Apply`` state.
******************************************
PTP is not supported on Broadcom 57504 NIC
******************************************
|PTP| is not supported on the Broadcom 57504 NIC.
**Workaround**: None. Do not configure |PTP| instances on the Broadcom 57504
NIC.
*************************************
Metrics Server Update across Upgrades
*************************************
After a platform upgrade, the Metrics Server will NOT be automatically updated.
**Workaround**: To update the Metrics Server,
**See**: :ref:`Install Metrics Server <kubernetes-admin-tutorials-metrics-server>`
***********************************************************************************
Horizon Drop-Down lists in Chrome and Firefox causes issues due to the new branding
***********************************************************************************
Drop-down menus in Horizon do not work due to the 'select' HTML element on Chrome
and Firefox.
It is considered a 'replaced element' as it is generated by the browser and/or
operating system. This element has a limited range of customizable CSS
properties.
**Workaround**: The system should be 100% usable even with this limitation.
Changing browser's and/or operating system's theme could solve display issues
in case they limit the legibility of the elements (i.e. white text and
white background).
************************************************************************************************
Deploying an App using nginx controller fails with internal error after controller.name override
************************************************************************************************
An Helm override of controller.name to the nginx-ingress-controller app may
result in errors when creating ingress resources later on.
Example of Helm override:
.. code-block::none
cat <<EOF> values.yml
controller:
name: notcontroller
EOF
~(keystone_admin)$ system helm-override-update nginx-ingress-controller ingress-nginx kube-system --values values.yml
+----------------+-----------------------+
| Property | Value |
+----------------+-----------------------+
| name | ingress-nginx |
| namespace | kube-system |
| user_overrides | controller: |
| | name: notcontroller |
| | |
+----------------+-----------------------+
~(keystone_admin)$ system application-apply nginx-ingress-controller
**Workaround**: NA
************************************************
Kata Container is not supported on StarlingX 8.0
************************************************
Kata Containers that were supported on CentOS in earlier releases of |prod-long|
will not be supported on |prod-long| Release 8.0.
***********************************************
Vault is not supported on StarlingX Release 8.0
***********************************************
The Vault application is not supported on |prod-long| Release 8.0.
**Workaround**: NA
***************************************************
Portieris is not supported on StarlingX Release 8.0
***************************************************
The Portieris application is not supported on |prod-long| Release 8.0.
**Workaround**: NA
*****************************
DCManager Patch Orchestration
*****************************
.. warning::
Patches must be applied or removed on the System Controller prior to using
the :command:`dcmanager patch-strategy` command to propagate changes to the
subclouds.
****************************************
Optimization with a Large number of OSDs
****************************************
As Storage nodes are not optimized, you may need to optimize your Ceph
configuration for balanced operation across deployments with a high number of
|OSDs|. This results in an alarm being generated even if the installation
succeeds.
800.001 - Storage Alarm Condition: HEALTH_WARN. Please check 'ceph -s'
**Workaround**: To optimize your storage nodes with a large number of |OSDs|, it
is recommended to use the following commands:
.. code-block:: none
$ ceph osd pool set kube-rbd pg_num 256
$ ceph osd pool set kube-rbd pgp_num 256
******************************************************************
PTP tx_timestamp_timeout causes ptp4l port to transition to FAULTY
******************************************************************
NICs using the Intel Ice NIC driver may report the following in the `ptp4l``
logs, which might coincide with a |PTP| port switching to ``FAULTY`` before
re-initializing.
.. code-block:: none
ptp4l[80330.489]: timed out while polling for tx timestamp
ptp4l[80330.CGTS-30543489]: increasing tx_timestamp_timeout may correct this issue, but it is likely caused by a driver bug
This is due to a limitation of the Intel ICE driver.
**Workaround**: The recommended workaround is to set the ``tx_timestamp_timeout``
parameter to 700 (ms) in the ``ptp4l`` config using the following command.
.. code-block:: none
~(keystone_admin)]$ system ptp-instance-parameter-add ptp-inst1 tx_timestamp_timeout=700
***************
BPF is disabled
***************
|BPF| cannot be used in the PREEMPT_RT/low latency kernel, due to the inherent
incompatibility between PREEMPT_RT and |BPF|, see, https://lwn.net/Articles/802884/.
Some packages might be affected when PREEMPT_RT and BPF are used together. This
includes the following, but not limited to these packages.
- libpcap
- libnet
- dnsmasq
- qemu
- nmap-ncat
- libv4l
- elfutils
- iptables
- tcpdump
- iproute
- gdb
- valgrind
- kubernetes
- cni
- strace
- mariadb
- libvirt
- dpdk
- libteam
- libseccomp
- binutils
- libbpf
- dhcp
- lldpd
- containernetworking-plugins
- golang
- i40e
- ice
**Workaround**: It is recommended not to use BPF with real time kernel.
If required it can still be used, for example, debugging only.
*****************
crashkernel Value
*****************
**crashkernel=auto** is no longer supported by newer kernels, and hence the
v5.10 kernel will not support the "auto" value.
**Workaround**: |prod-long| uses **crashkernel=2048m** instead of
**crashkernel=auto**.
.. note::
|prod-long| Release 8.0 has increased the amount of reserved memory for
the crash/kdump kernel from 512 MiB to 2048 MiB.
***********************
Control Group parameter
***********************
The control group (cgroup) parameter **kmem.limit_in_bytes** has been
deprecated, and results in the following message in the kernel's log buffer
(dmesg) during boot-up and/or during the Ansible bootstrap procedure:
"kmem.limit_in_bytes is deprecated and will be removed. Please report your
use case to linux-mm@kvack.org if you depend on this functionality." This
parameter is used by a number of software packages in |prod-long|, including,
but not limited to, **systemd, docker, containerd, libvirt** etc.
**Workaround**: NA. This is only a warning message about the future deprecation
of an interface.
****************************************************
Kubernetes Taint on Controllers for Standard Systems
****************************************************
In Standard systems, a Kubernetes taint is applied to controller nodes in order
to prevent application pods from being scheduled on those nodes; since
controllers in Standard systems are intended ONLY for platform services.
If application pods MUST run on controllers, a Kubernetes toleration of the
taint can be specified in the application's pod specifications.
**Workaround**: Customer applications that need to run on controllers on
Standard systems will need to be enabled/configured for Kubernetes toleration
in order to ensure the applications continue working after an upgrade from
|prod-long| Release 6.0 to |prod-long| future Releases. It is suggested to add
the Kubernetes toleration to your application prior to upgrading to |prod-long|
Release 8.0.
You can specify toleration for a pod through the pod specification (PodSpec).
For example:
.. code-block:: none
spec:
....
template:
....
spec
tolerations:
- key: "node-role.kubernetes.io/control-plane"
operator: "Exists"
effect: "NoSchedule"
- key: "node-role.kubernetes.io/control-plane"
operator: "Exists"
effect: "NoSchedule"
**See**: `Taints and Tolerations <https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/>`__.
********************************************************
New Kubernetes Taint on Controllers for Standard Systems
********************************************************
A new Kubernetes taint will be applied to controllers for Standard systems in
order to prevent application pods from being scheduled on controllers; since
controllers in Standard systems are intended ONLY for platform services. If
application pods MUST run on controllers, a Kubernetes toleration of the taint
can be specified in the application's pod specifications. You will also need to
change the nodeSelector / nodeAffinity to use the new label.
**Workaround**: Customer applications that need to run on controllers on
Standard systems will need to be enabled/configured for Kubernetes toleration
in order to ensure the applications continue working after an upgrade to
|prod-long| Release 8.0 and |prod-long| future Releases.
You can specify toleration for a pod through the pod specification (PodSpec).
For example:
.. code-block:: none
spec:
....
template:
....
spec
tolerations:
- key: "node-role.kubernetes.io/control-plane"
operator: "Exists"
effect: "NoSchedule"
**See**: `Taints and Tolerations <https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/>`__.
**************************************************************
Ceph alarm 800.001 interrupts the AIO-DX upgrade orchestration
**************************************************************
Upgrade orchestration fails on |AIO-DX| systems that have Ceph enabled.
**Workaround**: Clear the Ceph alarm 800.001 by manually upgrading both
controllers and using the following command:
.. code-block:: none
~(keystone_admin)]$ ceph mon enable-msgr2
Ceph alarm 800.001 is cleared.
***************************************************************
Storage Nodes are not considered part of the Kubernetes cluster
***************************************************************
When running the :command:`system kube-host-upgrade-list` command the output
must only display controller and worker hosts that have control-plane and kubelet
components. Storage nodes do not have any of those components and so are not
considered a part of the Kubernetes cluster.
**Workaround**: Do not include Storage nodes.
***************************************************************************************
Backup and Restore of ACC100 (Mount Bryce) configuration requires double unlock attempt
***************************************************************************************
After restoring from a previous backup with an Intel ACC100 processing
accelerator device, the first unlock attempt will be refused since this
specific kind of device will be updated in the same context.
**Workaround**: A second attempt after few minutes will accept and unlock the
host.
**************************************
Application Pods with SRIOV Interfaces
**************************************
Application Pods with |SRIOV| Interfaces require a **restart-on-reboot: "true"**
label in their pod spec template.
Pods with |SRIOV| interfaces may fail to start after a platform restore or
Simplex upgrade and persist in the **Container Creating** state due to missing
PCI address information in the |CNI| configuration.
**Workaround**: Application pods that require|SRIOV| should add the label
**restart-on-reboot: "true"** to their pod spec template metadata. All pods with
this label will be deleted and recreated after system initialization, therefore
all pods must be restartable and managed by a Kubernetes controller
\(i.e. DaemonSet, Deployment or StatefulSet) for auto recovery.
Pod Spec template example:
.. code-block:: none
template:
metadata:
labels:
tier: node
app: sriovdp
restart-on-reboot: "true"
***********************
Management VLAN Failure
***********************
If the Management VLAN fails on the active System Controller, communication
failure 400.005 is detected, and alarm 280.001 is raised indicating
subclouds are offline.
**Workaround**: System Controller will recover and subclouds are manageable
when the Management VLAN is restored.
********************************
Host Unlock During Orchestration
********************************
If a host unlock during orchestration takes longer than 30 minutes to complete,
a second reboot may occur. This is due to the delays, VIM tries to abort. The
abort operation triggers the second reboot.
**Workaround**: NA
**************************************
Storage Nodes Recovery on Power Outage
**************************************
Storage nodes take 10-15 minutes longer to recover in the event of a full
power outage.
**Workaround**: NA
*************************************
Ceph OSD Recovery on an AIO-DX System
*************************************
In certain instances a Ceph OSD may not recover on an |AIO-DX| system
\(for example, if an OSD comes up after a controller reboot and a swact
occurs), and remains in the down state when viewed using the :command:`ceph -s`
command.
**Workaround**: Manual recovery of the OSD may be required.
********************************************************
Using Helm with Container-Backed Remote CLIs and Clients
********************************************************
If **Helm** is used within Container-backed Remote CLIs and Clients:
- You will NOT see any helm installs from |prod| Platform's system
FluxCD applications.
**Workaround**: Do not directly use **Helm** to manage |prod| Platform's
system FluxCD applications. Manage these applications using
:command:`system application` commands.
- You will NOT see any helm installs from end user applications, installed
using **Helm** on the controller's local CLI.
**Workaround**: It is recommended that you manage your **Helm**
applications only remotely; the controller's local CLI should only be used
for management of the |prod| Platform infrastructure.
*********************************************************************
Remote CLI Containers Limitation for StarlingX Platform HTTPS Systems
*********************************************************************
The python2 SSL lib has limitations with reference to how certificates are
validated. If you are using Remote CLI containers, due to a limitation in
the python2 SSL certificate validation, the certificate used for the 'ssl'
certificate should either have:
#. CN=IPADDRESS and SAN=empty or,
#. CN=FQDN and SAN=FQDN
**Workaround**: Use CN=FQDN and SAN=FQDN as CN is a deprecated field in
the certificate.
*******************************************************************
Cert-manager does not work with uppercase letters in IPv6 addresses
*******************************************************************
Cert-manager does not work with uppercase letters in IPv6 addresses.
**Workaround**: Replace the uppercase letters in IPv6 addresses with lowercase
letters.
.. code-block:: none
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: oidc-auth-apps-certificate
namespace: test
spec:
secretName: oidc-auth-apps-certificate
dnsNames:
- ahost.com
ipAddresses:
- fe80::903a:1c1a:e802::11e4
issuerRef:
name: cloudplatform-interca-issuer
kind: Issuer
*******************************
Kubernetes Root CA Certificates
*******************************
Kubernetes does not properly support **k8s_root_ca_cert** and **k8s_root_ca_key**
being an Intermediate CA.
**Workaround**: Accept internally generated **k8s_root_ca_cert/key** or
customize only with a Root CA certificate and key.
************************
Windows Active Directory
************************
- **Limitation**: The Kubernetes API does not support uppercase IPv6 addresses.
**Workaround**: The issuer_url IPv6 address must be specified as lowercase.
- **Limitation**: The refresh token does not work.
**Workaround**: If the token expires, manually replace the ID token. For
more information, see, :ref:`Configure Kubernetes Client Access <configure-kubernetes-client-access>`.
- **Limitation**: TLS error logs are reported in the **oidc-dex** container
on subclouds. These logs should not have any system impact.
**Workaround**: NA
- **Limitation**: **stx-oidc-client** liveness probe sometimes reports
failures. These errors may not have system impact.
**Workaround**: NA
.. Stx LP Bug: https://bugs.launchpad.net/starlingx/+bug/1846418
************
BMC Password
************
The BMC password cannot be updated.
**Workaround**: In order to update the BMC password, de-provision the BMC,
and then re-provision it again with the new password.
****************************************
Application Fails After Host Lock/Unlock
****************************************
In some situations, application may fail to apply after host lock/unlock due to
previously evicted pods.
**Workaround**: Use the :command:`kubectl delete` command to delete the evicted
pods and reapply the application.
***************************************
Application Apply Failure if Host Reset
***************************************
If an application apply is in progress and a host is reset it will likely fail.
A re-apply attempt may be required once the host recovers and the system is
stable.
**Workaround**: Once the host recovers and the system is stable, a re-apply
may be required.
********************************
Pod Recovery after a Host Reboot
********************************
On occasions some pods may remain in an unknown state after a host is rebooted.
**Workaround**: To recover these pods kill the pod. Also based on `https://github.com/kubernetes/kubernetes/issues/68211 <https://github.com/kubernetes/kubernetes/issues/68211>`__
it is recommended that applications avoid using a subPath volume configuration.
****************************
Rare Node Not Ready Scenario
****************************
In rare cases, an instantaneous loss of communication with the active
**kube-apiserver** may result in kubernetes reporting node\(s) as stuck in the
"Not Ready" state after communication has recovered and the node is otherwise
healthy.
**Workaround**: A restart of the **kublet** process on the affected node\(s)
will resolve the issue.
*************************
Platform CPU Usage Alarms
*************************
Alarms may occur indicating platform cpu usage is \>90% if a large number of
pods are configured using liveness probes that run every second.
**Workaround**: To mitigate either reduce the frequency for the liveness
probes or increase the number of platform cores.
*******************
Pods Using isolcpus
*******************
The isolcpus feature currently does not support allocation of thread siblings
for cpu requests (i.e. physical thread +HT sibling).
**Workaround**: NA
*****************************
system host-disk-wipe command
*****************************
The system host-disk-wipe command is not supported in this release.
**Workaround**: NA
*************************************************************
Restrictions on the Size of Persistent Volume Claims (PVCs)
*************************************************************
There is a limitation on the size of Persistent Volume Claims (PVCs) that can
be used for all StarlingX Platform Releases.
**Workaround**: It is recommended that all PVCs should be a minimum size of
1GB. For more information, see, `https://bugs.launchpad.net/starlingx/+bug/1814595 <https://bugs.launchpad.net/starlingx/+bug/1814595>`__.
***************************************************************
Sub-Numa Cluster Configuration not Supported on Skylake Servers
***************************************************************
Sub-Numa cluster configuration is not supported on Skylake servers.
**Workaround**: For servers with Skylake Gold or Platinum CPUs, Sub-NUMA
clustering must be disabled in the BIOS.
*****************************************************************
The ptp-notification-demo App is Not a System-Managed Application
*****************************************************************
The ptp-notification-demo app is provided for demonstration purposes only.
Therefore, it is not supported on typical platform operations such as Backup
and Restore.
**Workaround**: NA
*************************************************************************
Deleting image tags in registry.local may delete tags under the same name
*************************************************************************
When deleting image tags in the registry.local docker registry, you should be
aware that the deletion of an **<image-name:tag-name>** will delete all tags
under the specified <image-name> that have the same 'digest' as the specified
<image-name:tag-name>. For more information, see, :ref:`Delete Image Tags in the Docker Registry <delete-image-tags-in-the-docker-registry-8e2e91d42294>`.
**Workaround**: NA
------------------
Deprecated Notices
------------------
.. All please confirm if all these have been removed from the StarlingX 9.0 Release?
****************************
Airship Armada is deprecated
****************************
.. note::
Airship Armada is removed in stx.9.0 and replaced with FluxCD. All Armada
based applications have to be removed before you perform an
upgrade from |prod-long| Release 9.0 to |prod-long| Release 10.0.
.. note::
Some application repositories may still have "armada" in the file path but
are now supported by FluxCD. See https://opendev.org/starlingx/?sort=recentupdate&language=&q=armada.
StarlingX Release 7.0 introduces FluxCD based applications that utilize FluxCD
Helm/source controller pods deployed in the flux-helm Kubernetes namespace.
Airship Armada support is now considered to be deprecated. The Armada pod will
continue to be deployed for use with any existing Armada based applications but
will be removed in StarlingX Release 8.0, once the stx-openstack Armada
application is fully migrated to FluxCD.
************************************
Cert-manager API Version deprecation
************************************
The upgrade of cert-manager from 0.15.0 to 1.7.1, deprecated support for
cert manager API versions cert-manager.io/v1alpha2 and cert-manager.io/v1alpha3.
When creating cert-manager |CRDs| (certificates, issuers, etc) with |prod-long|
Release 8.0, use cert-manager.io/v1.
***************
Kubernetes APIs
***************
Kubernetes APIs that will be removed in K8s 1.25 are listed below:
**See**: https://kubernetes.io/docs/reference/using-api/deprecation-guide/#v1-25
--------------------------------------
Release Information for other versions
--------------------------------------
You can find details about a release on the specific release page.
.. list-table::
* - Version
- Release Date
- Notes
- Status
* - StarlingX R8.0
- 2023-02
- https://docs.starlingx.io/r/stx.8.0/releasenotes/index.html
- Maintained
* - StarlingX R7.0
- 2022-07
- https://docs.starlingx.io/r/stx.7.0/releasenotes/index.html
- Maintained
* - StarlingX R6.0
- 2021-12
- https://docs.starlingx.io/r/stx.6.0/releasenotes/index.html
- Maintained
* - StarlingX R5.0.1
- 2021-09
- https://docs.starlingx.io/r/stx.5.0/releasenotes/index.html
- :abbr:`EOL (End of Life)`
* - StarlingX R5.0
- 2021-05
- https://docs.starlingx.io/r/stx.5.0/releasenotes/index.html
- :abbr:`EOL (End of Life)`
* - StarlingX R4.0
- 2020-08
-
- :abbr:`EOL (End of Life)`
* - StarlingX R3.0
- 2019-12
-
- :abbr:`EOL (End of Life)`
* - StarlingX R2.0.1
- 2019-10
-
- :abbr:`EOL (End of Life)`
* - StarlingX R2.0
- 2019-09
-
- :abbr:`EOL (End of Life)`
* - StarlingX R12.0
- 2018-10
-
- :abbr:`EOL (End of Life)`
StarlingX follows the release maintenance timelines in the `StarlingX Release
Plan <https://wiki.openstack.org/wiki/StarlingX/Release_Plan#Release_Maintenance>`_.
The Status column uses `OpenStack maintenance phase <https://docs.openstack.org/
project-team-guide/stable-branches.html#maintenance-phases>`_ definitions.