
This document provides details of issues found in scale testing of fuel-ccp tool and OpenStack containerized control plane installed by it. It is split into two sections, one dedicated to Kubernetes issues, and the other to OpenStack-specific problems. Change-Id: Ided3c47f7425d5e8b5fe2bdea9e794fbb4d550ec
28 KiB
Kubernetes Issues At Scale 900 Minions
Glossary
- Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications.
- fuel-ccp: CCP stands for “Containerized Control Plane”. The goal of this project is to make building, running and managing production-ready OpenStack containers on top of Kubernetes an easy task for operators.
- OpenStack is a cloud operating system that controls large pools of compute, storage, and networking resources throughout a datacenter, all managed through a dashboard that gives administrators control while empowering their users to provision resources through a web interface.
- Heat is an OpenStack service to orchestrate composite cloud applications using a declarative template format through an OpenStack-native REST API.
- Slice is a set of 6 VMs:
- 1x Yahoo! Benchmark (ycsb)
- 1x Cassandra
- 1x Magento
- 1x Wordpress
- 2x Idle VM
Setup
We had about 181 bare metal machines, 3 of them were used for Kubernetes control plane services placement (API servers, ETCD, Kubernetes scheduler, etc.), others had 5 virtual machines on each node, where every VM was used as a Kubernetes minion node.
Each bare metal node has the following specifications:
- HP ProLiant DL380 Gen9
- CPU - 2x Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
- RAM - 264G
- Storage - 3.0T on RAID on HP Smart Array P840 Controller, HDD -12 x HP EH0600JDYTL
- Network - 2x Intel Corporation Ethernet 10G 2P X710
Running OpenStack cluster (from Kubernetes point of view) is represented with the following numbers:
- OpenStack control plane services are running within ~80 pods on 6 nodes
- ~4500 pods are spread across all remaining nodes, 5 pods on each.
Kubernetes architecture analysis obstacles
During the 900 nodes tests we used Prometheus monitoring tool for the verification of the resources consumption and the load put on core system, Kubernetes and OpenStack levels services. During one of the Prometheus configuration optimisations old data from the Prometheus storage was deleted to improve Prometheus API speed, and this old data included 900 nodes cluster information, therefore we have only partial data being available for the post run investigation. This fact, although, does not influence overall reference architecture analysis, as all issues, that were observed during the containerized OpenStack setup testing, were thoughtfully documented and debugged.
To prevent monitoring data loss in future (Q1 2017 timeframe and further) we need to proceed with the following improvements of the monitoring setup:
- Prometheus by default is more optimized to be used as real time monitoring / alerting system, and there is an official recommendation from Prometheus developers team to keep monitoring data retention for about 15 days to keep tool working in quick and responsive manner. To keep old data for the post-usage analytics purposes external store requires to be configured.
- We need to reconfigure monitoring tool (Prometheus) to include data backup to one of the persistent time series databases (e.g. InfluxDB / Cassandra / OpenTSDB) that’s supported as an external persistent data store by Prometheus. This will allow us to store old data for extended amount of time for post-processing needs.
Observed issues
Huge load on kube-apiserver
Symptoms
Both API servers, running in Kubernetes cluster, were utilising up to 2000% of CPU (up to 45% of total node compute performance capacity) after we migrated them to hardware nodes. Initial setup with all nodes (including Kubernetes control plane nodes) running on virtualized environment was showing not workable API servers at all.
Root cause
All services that are placed not on Kubernetes masters
(kubelet
, kube-proxy
on all minions) access
kube-apiserver
via local ngnix
proxy.
Most of those requests are watch requests that stay mostly idle after
they are initiated (most timeouts on them are defined to be about 5-10
minutes). nginx
was configured to cut idle connections in 3
seconds, which makes all clients to reconnect and (the worst) restart
aborted SSL session. On the server side it makes
kube-apiserver
consume up to 2000% CPU resources and other
requests become very slow.
Solution
Set proxy_timeout
parameter to 10 minutes in
nginx.conf
config file, which should be more than enough
not to cut SSL connections before requests time out by themselves. After
this fix was applied, one api-server became to consume 100% of CPU
(about 2% of total node compute performance capacity), the second one
about 200% (about 4% of total node compute performance capacity) of CPU
(with average response time 200-400 ms).
Upstream issue (fixed)
Make Kargo deployment tool set proxy_timeout
to 10
minutes: issue
fixed with pull
request by Fuel CCP team.
KubeDNS cannot handle big cluster load with default settings
Symptoms
When deploying OpenStack cluster on this scale, kubedns
becomes unresponsive because of high load. This end up with very often
error appearing in logs of dnsmasq
container in
kubedns
pod:
Maximum number of concurrent DNS queries reached.
Also dnsmasq
containers sometimes get restarted due to
hitting high memory limit.
Root cause
First of all, kubedns
seems to fail often on high load
(or even without load), during the experiment we observed continuous
kubedns container restarts even on empty (but big enough) Kubernetes
cluster. Restarts are caused by liveness check failing, although nothing
notable is observed in any logs.
Second, dnsmasq
should have taken load off
kubedns
, but it needs some tuning to behave as expected for
big load, otherwise it is useless.
Solution
This requires several levels of fixing:
- Set higher limits for
dnsmasq
containers: they take on most of the load. - Add more replicas to
kubedns
replication controller (we decided to stop on 6 replicas, as it solved the observed issue - for bigger clusters it might be needed to increase this number even more). - Increase number of parallel connections
dnsmasq
should handle (we used--dns-forward-max=1000
which is recommendaed parameter setup indnsmasq
manuals) - Increase size of cache in
dnsmasq
: it has hard limit of 10000 cache entries which seems to be reasonable amount. - Fix
kubedns
to handle this behaviour in proper way.
Upstream issues (partially fixed)
#1 and #2 are fixed by making them configurable in Kargo by Kubernetes team: issue, pull request.
Other fixes are still being implemented as of time of this publication.
Kubernetes scheduler is ineffective with pod antiaffinity
Symptoms
It takes significant amount of time for scheduler to process pods with pod antiaffinity rules specified on them. It is spending about 2-3 seconds on each pod which makes time needed to deploy OpenStack cluster on 900 nodes unexpectedly long (about 3h for just scheduling). Antiaffinity rules are required to be used for OpenStack deployment to prevent several OpenStack compute nodes to be mixed and messed to one Kubernetes Minion node.
Root cause
According to profiling results, most of the time is spent on creating new Selectors to match existing pods against them, which triggers validation step. Basically we have O(N^2) unnecessary validation steps (N - number of pods), even if we have just 5 deployments entities covering most of the nodes.
Solution
Specific optimization that speeds up scheduling time up to about 300 ms/pod was required in this case. It’s still slow in terms of common sense (about 30m spent just on pods scheduling for 900 nodes OpenStack cluster), but is close to be reasonable. This solution lowers number of very expensive operations to O(N), which is better, but still depends on number of pods instead of deployments, so there is space for future improvement.
Upstream issues
Optimization merged into master: pull request; backported to 1.5 branch (will release in 1.5.2 release): pull request.
Kubernetes scheduler needs to be deployed on separate node
Symptoms
During huge OpenStack cluster deployment against pre-deployed
Kubernetes scheduler
, controller-manager
and
apiserver
start competing for CPU cycles as all of them get
big load. Scheduler is more resource-hungry (see next problem), so we
need a way to deploy it separately.
Root Cause
The same problem with Kubsernetes scheduler efficiency at scale of about 1000 nodes as in the issue above.
Solution
Kubernetes scheduler was moved to a separate node manually, all other schedulers were manually killed to prevent them from moving to other nodes.
Upstream issues
Issue created in Kargo installer Github repository.
kube-apiserver have low default rate limit
Symptoms
Different services start receiving “429 Rate Limit Exceeded” HTTP
error even though kube-apiservers
can take more load. It is
linked to a scheduler bug (see below).
Solution
Raise rate limit for kube-apiserver process
via
--max-requests-inflight
option. It defaults to 400, in our
case it became workable at 2000. This number should be configurable in
Kargo deployment tool, as for bigger deployments it might be required to
increase it accordingly.
Upstream issues
Upstream issue or pull request was not created for this issue.
Kubernetes scheduler can schedule wrongly
Symptoms
When many pods are being created (~4500 in our case of OpenStack
deployment) and faced with 429 error from kube-apiserver
(see above), the scheduler can schedule several pods of the same
deployment on one node in violation of pod antiaffinity rule on
them.
Root cause
This issue arises due to scheduler cache being evicted before the pod actually processed.
Upstream issues
Pull request accepted in Kubernetes upstream.
Docker become unresponsive at random
Symptoms
Docker process sometimes hangs on several nodes, which results in
timeouts in kubelet
logs and pods cannot be spawned or
terminated successfully on the affected minion node. Although bunch of
similar issues has been fixed in Docker since 1.11, we still are
observing those symptoms.
Workaround
Docker daemon logs does not contain any notable information, so we had to restart docker service on the affected node (during those experiments we used Docker 1.12.3, but we have observed similar symptoms in 1.13 as well).
Calico start up time is too long
Symptoms
If we have to kill a Kubernetes node, Calico requires ~5 minutes to reestablish all mesh connections.
Root cause
Calico uses BGP, so without route reflector it has to do full-mesh between all nodes in cluster.
Solution
We need to switch to using route reflectors in our clusters. Then every node needs only to establish connections to all reflectors.
Upstream Issues
None. For production use, architecutre of Calico network should be adjusted to use route reflectors set up on selected nodes or on switching fabric hardware. This will reduce the number of BGP connections per node and speed up the Calico startup.
OpenStack Issues At Scale 200 Compute Nodes
Workloads testing approach
The goal of this test was to investigate OpenStack and Kubernetes behavior under load. This load needs to emulate end-user traffic running on OpenStack servers (guest systems) with different types of applications running against cloud system. We were interested in the following metrics: CPU usage, Disk statistics, Network load, Memory used and IO stats on each controller nodes and chosen set of the compute nodes.
Testing preparation steps
To preare for testing, several steps should be made:
- [0] Pre-deploy Kubernetes + OpenStack Newton environment (Kargo + fuel-ccp tools)
- [1] Establish advanced monitoring of the testing environment (on all three layers - system, Kubernetes and OpenStack)
- [2] Prepare to automatically deploy slices against OpenStack and configure applications running within them
- [3] Prepare to run workloads against applications running within slices
[1] was achieved through configuring Prometheus monitoring and alerting tool with Collectd and Telegraf collectors. Separated monitoring document is under preparation to present significant effort spent on this task.
[2] was automated by using Heat OpenStack orchestration service templates.
[3] was automated mostly through generating HTTP trafic native for the applications under test using Locust.IO tool templates. Cassandra VM workload was automated through Yahoo! Benchmark (ycsb) running on neighbour VM of the same slice.
Step [1] was tested against 900 nodes Kubernetes cluster described in section above and later against all test environments we had, steps [2] and [3] were verified against small (20 nodes) testing environment in parallel with finalizing step [1] workability. During this verification several issues with recently introduced Heat support to fuel-ccp were observed and fixed. Later all of those steps were assumed to be run against 200 bare metal nodes lab, but bunch of issues was observed during step [2] that blocked finishing the testing. All of those issues (including those found during small environment verification) are listed below.
Heat/fuel-ccp integration issues
Lack of Heat domain configuration
Symptoms
Authentication errors during Heat stacks (representing workloads testing slices on each compute node) creation.
Root cause
During OpenStack Newton timeframe Orchestration has started to
require additional information in the Identity service to manage stacks
- in particular, configuration of heat domain that would contain stacks
projects and users, creating of heat_domain_admin
user to
manage projects and users in the heat domain and adding the admin role
to the heat_domain_admin
user in the heat domain to enable
administrative stack management privileges by the
heat_domain_admin
user. Please take a look on OpenStack
Newton configuration guide for more information.
Solution
Set up needed configuration steps in fuel-ccp
:
Lack of heat-api-cfn service configuration in fuel-ccp
Symptoms
Applications configured in Heat templates are not receiving their configurations and data required for the succeeded applications workability.
Root cause
Initial Heat support in fuel-ccp
did not include
heat-api-cfn
service, which is used by Heat for some config
transports. This service is necessary to support default
user_data_format
(HEAT_CFNTOOLS
), used for
most applications-related Heat templates.
Solution
Add new heat-api-cfn
image, which will be used to create
new service and configure all necessary endpoints for it in
fuel-ccp
tool. Patches to fuel-ccp
:
Heat endpoint not reachable from virtual machines
Symptoms
Heat API and Heat API CFN are not reachable from OpenStack VM, even though some of the applications configurations require such access.
Root cause
Fuel CCP deploys Heat services with default configuration and changes
endpoint_type
from publicURL
to
internalURL
. However, such configuration in Kubernetes
cluster is not enough for several types of Heat stack resources like
OS::Heat::Waitcondition
and
OS::Heat::SoftwareDeployment
, which require callbacks to
Heat API or Heat API CFN. Due to Kubernetes architecture, it's not
possible to do such callback on the default port value (for
heat-api
it's - 8004 and 8000 for
heat-api-cfn
).
Solution
There are two ways to fix described above issues:
- Out of the box, which requires just adding some data to .ccp.yaml configuration file. This workaround can be used prior OpenStack cluster deployment during future OpenStack cluster description.
- Second which requires some manual actions and can be processed when Openstack is already deployed and cloud administrator can change only one component configuration.
Both of those solutions are described in the patch to Fuel CCP Documentation. Please notice that additionally patch to Fuel CCP codebase need to be applied to make some of the Kubernetes options configurable.
Glance+Heat authentication misconfiguration
Symptoms
During Heat stacks (representing workloads testing slices) creation, random Glance authentication errors were observed.
Root cause
Service-specific users should have admin role in "service" project and should not belong to user-facing admin project. Initially Fuel-CCP contained Glance and Heat services misconfigured and several patches were required to fix it.
Solution
Workloads Testing Issues
Random loss of connection to MySQL
Symptoms
During Heat stacks (representing slices) creation some of the stacks are moved to ERROR state with the following traceback being found in the Heat logs:
2016-12-11 16:59:22.220 1165 ERROR nova.api.openstack.extensions
DBConnectionError: (pymysql.err.OperationalError) (2003, "Can't
connect to MySQL server on 'galera.ccp.svc.cluster.local' ([Errno
-2] Name or service not known)")
Root cause
This turned out to be exactly the same issue with KubeDNS being unable to handle high loads as described in Kubernetes Issues section above.
First of all, kubedns
seems to fail often on high load
(or even without load), during the experiment we observed continuous
kubedns container restarts even on empty (but big enough) Kubernetes
cluster. Restarts are caused by liveness check failing, although nothing
notable is observed in any logs.
Second, dnsmasq
should have taken load off
kubedns
, but it needs some tuning to behave as expected for
big load, otherwise it is useless.
Solution
See above in Kubernetes Issues section.
Slow Heat API requests (Bug 1653088) -----------------------------------------------------------------------------------------
Symptoms
Requests to the Heat API for regular requests took more than a minute of time. Example of such request can be presented with this traceback showing time needed for listing details of the specific stack
Root cause
Fuel-ccp team proposed that it might be a hidden race condition between multiple Heat workers, and it's not yet clear where is this race happening. This is still under debug
Workaround
Set up number of Heat workers to 1 until real cause of this issue will be found.
OpenStack VMs cannot fetch required metadata
Symptoms
OpenStack servers do not receive metadata with applications-specific information. There is a following trace in Heat logs.
Root cause
Prior pushing metadata to OpenStack VMs, Heat is storing information about stack under creation to its own database. Starting with OpenStack Newton Heat works in so-called “convergence” mode by default, making sure that Heat engine process several requests at one time. This parallel task processing might end up with race conditions during DB access.
Workaround
Turn off Heat convergence engine:
convergence_engine=False
RPC timeouts during Heat resources validation
Symptoms
During Heat resources validation, this process is failed with the following trace being caught.
Root cause
Initial assumption was that there might be too small
rpc_response_timeout
parameter being set up in Heat
configuration file. This parameter was set up to 10 minutes exactly like
that was used in MOS:
rpc_response_timeout = 600
After that was done, no more RPC timeouts were observed.
Solution
Set up rpc_response_timeout = 600
as that’s tested value
for generations of MOS releases.
Overloaded Memcache service (Bug 1653089) ----------------------------------------------------------------------------------------------
Symptoms
On 200 nodes with workloads deploying OpenStack is actively using
cache (memcached
). At some point of time requests to
memcached
begin to be processed really slow.
Root cause
Memcache size is 256 MB by default in Fuel CCP, which is really small size. That was a reason for great amount of retransmissions being processed.
Solution
Increase cache size up to 30G in fuel-ccp
configuration
file for huge or loaded deployments:
configs:
memcached:
address: 0.0.0.0
port:
cont: 11211
ram: 30720
Incorrect status information for Neutron agents
Symptoms
While asking for Neutron agents status through OpenStack client, all of them are displayed in down state, although due to the environment behaviour it does not seem to be true.
Root cause
The root cause of this issue is hidden in OpenStackSDK refactoring that caused various OSC networking commands to fail. The full discussion regarding this issue can be found in upstream bug.
Workaround
Use Neutron client directly to gather Neutron services statuses until bug will be fixed.
Nova client not working (Bug 1653075) ------------------------------------------------------------------------------------------
Symptoms
All commands running through Nova client end up with the stack trace.
Root cause
In debug mode it’s seen that client tries to use HTTP protocol for the Nova requests, e.g. http://compute.ccp.external:8443/v2.1/, although HTTPS protocol is required for this client-server conversation.
Solution
Add the following lines to
ccp-repos/fuel-ccp-nova/service/files/nova.conf.j2
file:
[DEFAULT]
secure_proxy_ssl_header = HTTP_X_FORWARDED_PROTO
Neutron server timeouts (Bug 1653077) ------------------------------------------------------------------------------------------
Symptoms
Neutron server is not able to process requests to it with the following error being caught in the Neutron logs: trace
Root cause
Default values (that are used in fuel-ccp
) for the
Neutron database pool size, as well as max overflow parameter are not
enough for the environment with big enough load on Neutron API.
Solution
Modify
ccp-repos/fuel-ccp-neutron/service/files/neutron.conf.j2
file to contain the following configuration parameters:
[database]
max_pool_size = 30
max_overflow = 60
No access to OpenStack VM from tenant network
Symptoms
Some of the VMs representing the slices are not reachable via tenant network. For example:
| 93b95c73-f849-4ffb-9108-63cf262d3a9f | cassandra_vm_0 |
ACTIVE | slice0-node162-net=11.62.0.8, 10.144.1.35 |
ubuntu-software-config-last |
root@node1:~# ssh -i .ssh/slace ubuntu@10.144.1.35
Connection closed by 10.144.1.35 port 22
It is unreachable from tenant network as well. For example from
instance b1946719-b401-447d-8103-cc43b03b1481
which has
been spawned by the same Heat stack on the same compute node
(node162
): http://paste.openstack.org/show/593486/
Root cause and solution
Still under investigation. Root cause not clear yet. This issue is blocking running workloads against deployed slices.
OpenStack services don’t handle PXC pseudo-deadlocks
Symptoms
When run in parallel, create operations of lots of resources were failing with DBError saying that Percona Xtradb Cluster identified a deadlock and transaction should be restarted.
Root cause
oslo.db is responsible for wrapping errors received from DB into proper classes so that services can restart transactions if similar errors occur, but it didn’t expect error in format that is being sent by Percona. After we fixed this, we still experienced similar errors because not all transactions that could be restarted were properly decorated in Nova code.
Upstream issues
Bug has been fixed by Roman Podolyaka’s CR and backported to Newton. It fixes Percona deadlock error detection, but there’s at least one place in Nova to be fixed (TBD)
Live migration failed with live_migration_uri configuration
Symptoms
With live_migration_uri
configuration, live migrations
fails because one compute host can’t connect to a libvirt on another
host.
Root cause
We can’t specify which IP address to use in template in
live_migration_uri
, so it was trying to use address from
first interface which happened to be in PXE network while libvirt
listens in private network. We couldn’t use
live_migration_inbound_addr
which would solve this problem
because of a problem in upstream Nova.
Upstream issues
A bug in
Nova has been fixed
and backported to
Newton. We switched to
using live_migration_inbound_addr
after that.
Contributors
The following people have credits for contributing to this document:
- Dina Belova <dbelova@mirantis.com>
- Yuriy Taraday <ytaraday@mirantis.com>