In the past few months, _most_ StarlingX services have moved
from static IP addressing to FQDN resolution, in support of
the management network reconfig feature.
While doing DC scalability testing, it was found that a transient
domain resolution (controller.internal) issue was found after
adding approximately 250 subclouds to the system and involved
the rabbitmq/RPC subsystem.
The error message returned was similar to:
"OSError: failed to resolve broker hostname"
The rabbitmq/amqp library is calling a _connect() function,
which in turn calls the python socket getaddrinfo()
Multiple attemps were made to reproduce the scenario in a
non-scaled lab by stressing the getaddrinfo(), getting
dnsmasq up to ~40 CPU usage, but the same error was not
returned.
Testing was done on the DC scale lab by manually changing the
rabbit and DB config files and this confirmed that using the static
floating IP (avoiding domain name resolution all-together
resolved the issue)
It was decided to revert the FQDN aspect of the dcmanager
and dcorch modules for now, as the management network
reconfiguration feature would not even apply to an
AIO-DX system controller at this time. This may be
re-evaluated in the future at which point a deeper dive
into the rabbit/RPC usage should be considered.
Testing:
- Install an AIO-DX system controller and install a subcloud.
Ensure the subcloud is managed and online.
- Ensure the dcmanager.conf and dcorch.conf commands use an IP
address in their transport_url and database connection
parameters.
Depends-On: https://review.opendev.org/c/starlingx/config/+/932013
Story: 2010722
Task: 48447
Change-Id: Icd067441dd08321936eb03498ff65241fac0010e