From fb8a2f1e3554878235cc4f2ad07bdd8e1cb71ba4 Mon Sep 17 00:00:00 2001 From: Oleg Gelbukh Date: Thu, 5 Jan 2017 14:20:26 -0800 Subject: [PATCH] Document issues in scale testing of fuel-ccp This document provides details of issues found in scale testing of fuel-ccp tool and OpenStack containerized control plane installed by it. It is split into two sections, one dedicated to Kubernetes issues, and the other to OpenStack-specific problems. Change-Id: Ided3c47f7425d5e8b5fe2bdea9e794fbb4d550ec --- doc/source/index.rst | 1 + doc/source/issues/index.rst | 12 + doc/source/issues/scale_testing_issues.rst | 881 +++++++++++++++++++++ 3 files changed, 894 insertions(+) create mode 100644 doc/source/issues/index.rst create mode 100644 doc/source/issues/scale_testing_issues.rst diff --git a/doc/source/index.rst b/doc/source/index.rst index f392d11..b52ee6e 100644 --- a/doc/source/index.rst +++ b/doc/source/index.rst @@ -24,6 +24,7 @@ Contents labs/index.rst test_plans/index test_results/index + issues/index .. raw:: pdf diff --git a/doc/source/issues/index.rst b/doc/source/issues/index.rst new file mode 100644 index 0000000..d6d2cc5 --- /dev/null +++ b/doc/source/issues/index.rst @@ -0,0 +1,12 @@ +.. raw:: pdf + + PageBreak oneColumn + +======================= +Issues Analysis +======================= + +.. toctree:: + :maxdepth: 2 + + scale_testing_issues diff --git a/doc/source/issues/scale_testing_issues.rst b/doc/source/issues/scale_testing_issues.rst new file mode 100644 index 0000000..0f4b272 --- /dev/null +++ b/doc/source/issues/scale_testing_issues.rst @@ -0,0 +1,881 @@ +.. _scale_testing_issues: + +====================================== +Kubernetes Issues At Scale 900 Minions +====================================== + +Glossary +======== + +- **Kubernetes** is an open-source system for automating deployment, + scaling, and management of containerized applications. + +- **fuel-ccp**: CCP stands for “Containerized Control Plane”. The goal + of this project is to make building, running and managing + production-ready OpenStack containers on top of Kubernetes an + easy task for operators. + +- **OpenStack** is a cloud operating system that controls large pools + of compute, storage, and networking resources throughout a + datacenter, all managed through a dashboard that gives + administrators control while empowering their users to provision + resources through a web interface. + +- **Heat** is an OpenStack service to orchestrate composite cloud + applications using a declarative template format through an + OpenStack-native REST API. + +- **Slice** is a set of 6 VMs: + + - 1x Yahoo! Benchmark (ycsb) + + - 1x Cassandra + + - 1x Magento + + - 1x Wordpress + + - 2x Idle VM + + +Setup +===== + +We had about 181 bare metal machines, 3 of them were used for Kubernetes +control plane services placement (API servers, ETCD, Kubernetes +scheduler, etc.), others had 5 virtual machines on each node, where +every VM was used as a Kubernetes minion node. + +Each bare metal node has the following specifications: + +- HP ProLiant DL380 Gen9 + +- **CPU** - 2x Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz + +- **RAM** - 264G + +- **Storage** - 3.0T on RAID on HP Smart Array P840 Controller, HDD - + 12 x HP EH0600JDYTL + +- **Network** - 2x Intel Corporation Ethernet 10G 2P X710 + +Running OpenStack cluster (from Kubernetes point of view) is represented +with the following numbers: + +1. OpenStack control plane services are running within ~80 pods on 6 + nodes + +2. ~4500 pods are spread across all remaining nodes, 5 pods on each. + +Kubernetes architecture analysis obstacles +========================================== + +During the 900 nodes tests we used `Prometheus `__ +monitoring tool for the +verification of the resources consumption and the load put on core +system, Kubernetes and OpenStack levels services. During one of the +Prometheus configuration optimisations old data from the Prometheus +storage was deleted to improve Prometheus API speed, and this old data +included 900 nodes cluster information, therefore we have only partial +data being available for the post run investigation. This fact, +although, does not influence overall reference architecture analysis, as +all issues, that were observed during the containerized OpenStack setup +testing, were thoughtfully documented and debugged. + +To prevent monitoring data loss in future (Q1 2017 timeframe and +further) we need to proceed with the following improvements of the +monitoring setup: + +1. Prometheus by default is more optimized to be used as real time + monitoring / alerting system, and there is an official + recommendation from Prometheus developers team to keep monitoring + data retention for about 15 days to keep tool working in quick + and responsive manner. To keep old data for the post-usage + analytics purposes external store requires to be configured. + +2. We need to reconfigure monitoring tool (Prometheus) to include data + backup to one of the persistent time series databases (e.g. + InfluxDB / Cassandra / OpenTSDB) that’s supported as an external + persistent data store by Prometheus. This will allow us to store + old data for extended amount of time for post-processing needs. + +Observed issues +=============== + +Huge load on kube-apiserver +--------------------------- + +Symptoms +~~~~~~~~ + +Both API servers, running in Kubernetes cluster, were utilising up to +2000% of CPU (up to 45% of total node compute performance capacity) +after we migrated them to hardware nodes. Initial setup with all nodes +(including Kubernetes control plane nodes) running on virtualized +environment was showing not workable API servers at all. + +Root cause +~~~~~~~~~~ + +All services that are placed not on Kubernetes masters (``kubelet``, +``kube-proxy`` on all minions) access ``kube-apiserver`` via local +``ngnix`` proxy. + +Most of those requests are watch requests that stay mostly idle after +they are initiated (most timeouts on them are defined to be about 5-10 +minutes). ``nginx`` was configured to cut idle connections in 3 seconds, +which makes all clients to reconnect and (the worst) restart aborted SSL +session. On the server side it makes ``kube-apiserver`` consume up to 2000% +CPU resources and other requests become very slow. + +Solution +~~~~~~~~ + +Set ``proxy_timeout`` parameter to 10 minutes in ``nginx.conf`` config +file, which should be more than enough not to cut SSL connections before +requests time out by themselves. After this fix was applied, one +api-server became to consume 100% of CPU (about 2% of total node compute +performance capacity), the second one about 200% (about 4% of total node +compute performance capacity) of CPU (with average response time 200-400 +ms). + +Upstream issue (fixed) +~~~~~~~~~~~~~~~~~~~~~~ + +Make Kargo deployment tool set ``proxy_timeout`` to 10 minutes: +`issue `__ +fixed with `pull request `__ +by Fuel CCP team. + +KubeDNS cannot handle big cluster load with default settings +------------------------------------------------------------ + +Symptoms +~~~~~~~~ + +When deploying OpenStack cluster on this scale, ``kubedns`` becomes +unresponsive because of high load. This end up with very often error +appearing in logs of ``dnsmasq`` container in ``kubedns`` pod:: + + Maximum number of concurrent DNS queries reached. + +Also ``dnsmasq`` containers sometimes get restarted due to hitting high +memory limit. + +Root cause +~~~~~~~~~~ + +First of all, ``kubedns`` seems to fail often on high load (or even without +load), during the experiment we observed continuous kubedns container +restarts even on empty (but big enough) Kubernetes cluster. Restarts +are caused by liveness check failing, although nothing notable is +observed in any logs. + +Second, ``dnsmasq`` should have taken load off ``kubedns``, but it needs some +tuning to behave as expected for big load, otherwise it is useless. + +Solution +~~~~~~~~ + +This requires several levels of fixing: + +1. Set higher limits for ``dnsmasq`` containers: they take on most of the + load. + +2. Add more replicas to ``kubedns`` replication controller (we decided to + stop on 6 replicas, as it solved the observed issue - for bigger + clusters it might be needed to increase this number even more). + +3. Increase number of parallel connections ``dnsmasq`` should handle (we + used ``--dns-forward-max=1000`` which is recommendaed parameter setup + in ``dnsmasq`` manuals) + +4. Increase size of cache in ``dnsmasq``: it has hard limit of 10000 cache + entries which seems to be reasonable amount. + +5. Fix ``kubedns`` to handle this behaviour in proper way. + +Upstream issues (partially fixed) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +#1 and #2 are fixed by making them configurable in Kargo by Kubernetes +team: +`issue `__, +`pull request `__. + +Other fixes are still being implemented as of time of this publication. + +Kubernetes scheduler is ineffective with pod antiaffinity +--------------------------------------------------------- + +Symptoms +~~~~~~~~ + +It takes significant amount of time for scheduler to process pods with +pod antiaffinity rules specified on them. It is spending about **2-3 +seconds** on each pod which makes time needed to deploy OpenStack +cluster on 900 nodes unexpectedly long (about 3h for just scheduling). +Antiaffinity rules are required to be used for OpenStack deployment to +prevent several OpenStack compute nodes to be mixed and messed to one +Kubernetes Minion node. + +Root cause +~~~~~~~~~~ + +According to profiling results, most of the time is spent on creating +new Selectors to match existing pods against them, which triggers +validation step. Basically we have O(N^2) unnecessary validation steps +(N - number of pods), even if we have just 5 deployments entities +covering most of the nodes. + +Solution +~~~~~~~~ + +Specific optimization that speeds up scheduling time up to about 300 +ms/pod was required in this case. It’s still slow in terms of common +sense (about 30m spent just on pods scheduling for 900 nodes OpenStack +cluster), but is close to be reasonable. This solution lowers number of +very expensive operations to O(N), which is better, but still depends on +number of pods instead of deployments, so there is space for future +improvement. + +Upstream issues +~~~~~~~~~~~~~~~ + +Optimization merged into master: `pull +request `__; +backported to 1.5 branch (will release in 1.5.2 release): `pull +request `__. + +Kubernetes scheduler needs to be deployed on separate node +---------------------------------------------------------- + +Symptoms +~~~~~~~~ + +During huge OpenStack cluster deployment against pre-deployed +Kubernetes ``scheduler``, ``controller-manager`` and ``apiserver`` start +competing for CPU cycles as all of them get big load. Scheduler is more +resource-hungry (see next problem), so we need a way to deploy it +separately. + +Root Cause +~~~~~~~~~~ + +The same problem with Kubsernetes scheduler efficiency at scale of about +1000 nodes as in the issue above. + +Solution +~~~~~~~~ + +Kubernetes scheduler was moved to a separate node manually, all other +schedulers were manually killed to prevent them from moving to other +nodes. + +Upstream issues +~~~~~~~~~~~~~~~ + +`Issue `__ +created in Kargo installer Github repository. + +kube-apiserver have low default rate limit +------------------------------------------ + +Symptoms +~~~~~~~~ + +Different services start receiving “429 Rate Limit Exceeded” HTTP error +even though ``kube-apiservers`` can take more load. It is linked to a +scheduler bug (see below). + +Solution +~~~~~~~~ + +Raise rate limit for ``kube-apiserver process`` via ``--max-requests-inflight`` +option. It defaults to 400, in our case it became workable at 2000. This +number should be configurable in Kargo deployment tool, as for bigger +deployments it might be required to increase it accordingly. + +Upstream issues +~~~~~~~~~~~~~~~ + +Upstream issue or pull request was not created for this issue. + +Kubernetes scheduler can schedule wrongly +----------------------------------------- + +Symptoms +~~~~~~~~ + +When many pods are being created (~4500 in our case of OpenStack +deployment) and faced with 429 error from ``kube-apiserver`` (see above), +the scheduler can schedule several pods of the same deployment on one node +in violation of pod antiaffinity rule on them. + +Root cause +~~~~~~~~~~ + +This issue arises due to scheduler cache being evicted before the pod +actually processed. + +Upstream issues +~~~~~~~~~~~~~~~ + +`Pull +request `__ accepted +in Kubernetes upstream. + +Docker become unresponsive at random +------------------------------------ + +Symptoms +~~~~~~~~ + +Docker process sometimes hangs on several nodes, which results in +timeouts in ``kubelet`` logs and pods cannot be spawned or terminated +successfully on the affected minion node. Although bunch of similar +issues has been fixed in Docker since 1.11, we still are observing those +symptoms. + +Workaround +~~~~~~~~~~ + +Docker daemon logs does not contain any notable information, so we had +to restart docker service on the affected node (during those experiments +we used Docker 1.12.3, but we have observed similar symptoms in 1.13 +as well). + +Calico start up time is too long +-------------------------------- + +Symptoms +~~~~~~~~ + +If we have to kill a Kubernetes node, Calico requires ~5 minutes to +reestablish all mesh connections. + +Root cause +~~~~~~~~~~ + +Calico uses BGP, so without route reflector it has to do full-mesh +between all nodes in cluster. + +Solution +~~~~~~~~ + +We need to switch to using route reflectors in our clusters. Then every +node needs only to establish connections to all reflectors. + +Upstream Issues +~~~~~~~~~~~~~~~ + +None. For production use, architecutre of Calico network should be +adjusted to use route reflectors set up on selected nodes or on +switching fabric hardware. This will reduce the number of BGP +connections per node and speed up the Calico startup. + +=========================================== +OpenStack Issues At Scale 200 Compute Nodes +=========================================== + +Workloads testing approach +========================== + +The goal of this test was to investigate OpenStack and Kubernetes +behavior under load. This load needs to emulate end-user traffic running +on OpenStack servers (guest systems) with different types of +applications running against cloud system. We were interested in the +following metrics: CPU usage, Disk statistics, Network load, Memory used +and IO stats on each controller nodes and chosen set of the compute +nodes. + +Testing preparation steps +------------------------- + +To preare for testing, several steps should be made: + +- [0] Pre-deploy Kubernetes + OpenStack Newton environment (Kargo + + fuel-ccp tools) + +- [1] Establish advanced monitoring of the testing environment (on + all three layers - system, Kubernetes and OpenStack) + +- [2] Prepare to automatically deploy slices against OpenStack and + configure applications running within them + +- [3] Prepare to run workloads against applications running within + slices + +[1] was achieved through configuring Prometheus monitoring and alerting +tool with Collectd and Telegraf collectors. Separated monitoring +document is under preparation to present significant effort spent on +this task. + +[2] was automated by using Heat OpenStack orchestration service +`templates `__. + +[3] was automated mostly through generating HTTP trafic native for the +applications under test using Locust.IO tool +`templates `__. +Cassandra VM workload was automated through Yahoo! Benchmark (ycsb) +running on neighbour VM of the same slice. + +Step [1] was tested against 900 nodes Kubernetes cluster described in +section above and later against all test environments we had, steps [2] and +[3] were verified against small (20 nodes) testing environment in +parallel with finalizing step [1] workability. During this verification +several issues with recently introduced Heat support to fuel-ccp were +observed and fixed. Later all of those steps were assumed to be run +against 200 bare metal nodes lab, but bunch of issues was observed +during step [2] that blocked finishing the testing. All of those issues +(including those found during small environment verification) are listed +below. + +Heat/fuel-ccp integration issues +================================ + +Lack of Heat domain configuration +--------------------------------- + +Symptoms +~~~~~~~~ + +Authentication errors during Heat stacks (representing workloads testing +slices on each compute node) creation. + +Root cause +~~~~~~~~~~ + +During OpenStack Newton timeframe Orchestration has started to require +additional information in the Identity service to manage stacks - in +particular, configuration of heat domain that would contain stacks +projects and users, creating of ``heat_domain_admin`` user to manage +projects and users in the heat domain and adding the admin role to the +``heat_domain_admin`` user in the heat domain to enable administrative +stack management privileges by the ``heat_domain_admin`` user. Please take +a look on `OpenStack Newton configuration +guide `__ +for more information. + +Solution +~~~~~~~~ + +Set up needed configuration steps in ``fuel-ccp``: + +- `Patch #1 `__ + +- `Patch #2 `__ + +Lack of heat-api-cfn service configuration in fuel-ccp +------------------------------------------------------ + +Symptoms +~~~~~~~~ + +Applications configured in Heat templates are not receiving their +configurations and data required for the succeeded applications +workability. + +Root cause +~~~~~~~~~~ + +Initial Heat support in ``fuel-ccp`` did not include ``heat-api-cfn`` +service, which is used by Heat for some config transports. This service is +necessary to support default ``user_data_format`` (``HEAT_CFNTOOLS``), used +for most applications-related Heat templates. + +Solution +~~~~~~~~ + +Add new ``heat-api-cfn`` image, which will be used to create new service and +configure all necessary endpoints for it in ``fuel-ccp`` tool. Patches to +``fuel-ccp``: + +- `Patch #1 `__ + +- `Patch #2 `__ + +Heat endpoint not reachable from virtual machines +------------------------------------------------- + +Symptoms +~~~~~~~~ + +Heat API and Heat API CFN are not reachable from OpenStack VM, even +though some of the applications configurations require such access. + +Root cause +~~~~~~~~~~ + +Fuel CCP deploys Heat services with default configuration and changes +``endpoint_type`` from ``publicURL`` to ``internalURL``. However, +such configuration in Kubernetes cluster is not enough for several types +of Heat stack resources like ``OS::Heat::Waitcondition`` and +``OS::Heat::SoftwareDeployment``, which require callbacks to Heat API +or Heat API CFN. Due to Kubernetes architecture, it's not possible to +do such callback on the default port value (for ``heat-api`` it's - 8004 +and 8000 for ``heat-api-cfn``). + +Solution +~~~~~~~~ + +There are two ways to fix described above issues: + +- Out of the box, which requires just adding some data to .ccp.yaml + configuration file. This workaround can be used prior OpenStack + cluster deployment during future OpenStack cluster description. + +- Second which requires some manual actions and can be processed when + Openstack is already deployed and cloud administrator can change + only one component configuration. + +Both of those solutions are described in the `patch to Fuel CCP +Documentation `__. Please +notice that additionally `patch to Fuel CCP +codebase `__ need to be +applied to make some of the Kubernetes options configurable. + +Glance+Heat authentication misconfiguration +------------------------------------------- + +Symptoms +~~~~~~~~ + +During Heat stacks (representing workloads testing slices) creation, +random Glance authentication errors were observed. + +Root cause +~~~~~~~~~~ + +Service-specific users should have admin role in "service" project and +should not belong to user-facing admin project. Initially Fuel-CCP +contained Glance and Heat services misconfigured and several patches +were required to fix it. + +Solution +~~~~~~~~ + +- `Patch #1 `__ + +- `Patch #2 `__ + +Workloads Testing Issues +======================== + +Random loss of connection to MySQL +---------------------------------- + +Symptoms +~~~~~~~~ + +During Heat stacks (representing slices) creation some of the stacks are +moved to ERROR state with the following traceback being found in the +Heat logs:: + + 2016-12-11 16:59:22.220 1165 ERROR nova.api.openstack.extensions + DBConnectionError: (pymysql.err.OperationalError) (2003, "Can't + connect to MySQL server on 'galera.ccp.svc.cluster.local' ([Errno + -2] Name or service not known)") + +Root cause +~~~~~~~~~~ + +This turned out to be exactly the same issue with KubeDNS being unable +to handle high loads as described in Kubernetes Issues section above. + +First of all, ``kubedns`` seems to fail often on high load (or even without +load), during the experiment we observed continuous kubedns container +restarts even on empty (but big enough) Kubernetes cluster. Restarts +are caused by liveness check failing, although nothing notable is +observed in any logs. + +Second, ``dnsmasq`` should have taken load off ``kubedns``, but it needs some +tuning to behave as expected for big load, otherwise it is useless. + +Solution +~~~~~~~~ + +See above in Kubernetes Issues section. + +Slow Heat API requests (`Bug 1653088 `__) +----------------------------------------------------------------------------------------- + +Symptoms +~~~~~~~~ + +Requests to the Heat API for regular requests took more than a minute of +time. Example of such request can be presented with `this +traceback `__ showing time +needed for listing details of the specific stack + +Root cause +~~~~~~~~~~ + +Fuel-ccp team proposed that it might be a hidden race condition between +multiple Heat workers, and it's not yet clear where is this race +happening. This is still under debug + +Workaround +~~~~~~~~~~ + +Set up number of Heat workers to 1 until real cause of this issue will +be found. + +OpenStack VMs cannot fetch required metadata +-------------------------------------------- + +Symptoms +~~~~~~~~ + +OpenStack servers do not receive metadata with applications-specific +information. There is a following +`trace `__ in Heat logs. + +Root cause +~~~~~~~~~~ + +Prior pushing metadata to OpenStack VMs, Heat is storing information +about stack under creation to its own database. Starting with OpenStack +Newton Heat works in so-called “convergence” mode by default, making +sure that Heat engine process several requests at one time. This +parallel task processing might end up with race conditions during DB +access. + +Workaround +~~~~~~~~~~ + +Turn off Heat convergence engine:: + + convergence_engine=False + +RPC timeouts during Heat resources validation +--------------------------------------------- + +Symptoms +~~~~~~~~ + +During Heat resources validation, this process is failed with the +following `trace `__ being +caught. + +Root cause +~~~~~~~~~~ + +Initial assumption was that there might be too small +``rpc_response_timeout`` parameter being set up in Heat configuration +file. This parameter was set up to 10 minutes exactly like that was used +in MOS:: + + rpc_response_timeout = 600 + +After that was done, no more RPC timeouts were observed. + +Solution +~~~~~~~~ + +Set up ``rpc_response_timeout = 600`` as that’s tested value for +generations of MOS releases. + +Overloaded Memcache service (`Bug 1653089 `__) +---------------------------------------------------------------------------------------------- + +Symptoms +~~~~~~~~ + +On 200 nodes with workloads deploying OpenStack is actively using cache +(``memcached``). At some point of time requests to ``memcached`` begin to be +processed really slow. + +Root cause +~~~~~~~~~~ + +Memcache size is 256 MB by default in Fuel CCP, which is really small +size. That was a reason for great amount of retransmissions being +processed. + +Solution +~~~~~~~~ + +Increase cache size up to 30G in ``fuel-ccp`` configuration file for huge or +loaded deployments:: + + configs: + memcached: + address: 0.0.0.0 + port: + cont: 11211 + ram: 30720 + +Incorrect status information for Neutron agents +----------------------------------------------- + +Symptoms +~~~~~~~~ + +While asking for Neutron agents status through OpenStack client, all of +them are displayed in down state, although due to the environment +behaviour it does not seem to be true. + +Root cause +~~~~~~~~~~ + +The root cause of this issue is hidden in OpenStackSDK refactoring that +caused various OSC networking commands to fail. The full discussion +regarding this issue can be found in `upstream +bug `__. + +Workaround +~~~~~~~~~~ + +Use Neutron client directly to gather Neutron services statuses until +`bug `__ +will be fixed. + +Nova client not working (`Bug 1653075 `__) +------------------------------------------------------------------------------------------ + +Symptoms +~~~~~~~~ + +All commands running through Nova client end up with the `stack +trace `__. + +Root cause +~~~~~~~~~~ + +In debug mode it’s seen that client tries to use HTTP protocol for the +Nova requests, e.g. http://compute.ccp.external:8443/v2.1/, although +HTTPS protocol is required for this client-server conversation. + +Solution +~~~~~~~~ + +Add the following lines to +``ccp-repos/fuel-ccp-nova/service/files/nova.conf.j2`` file:: + + [DEFAULT] + + secure_proxy_ssl_header = HTTP_X_FORWARDED_PROTO + +Neutron server timeouts (`Bug 1653077 `__) +------------------------------------------------------------------------------------------ + +Symptoms +~~~~~~~~ + +Neutron server is not able to process requests to it with the following +error being caught in the Neutron logs: +`trace `__ + +Root cause +~~~~~~~~~~ + +Default values (that are used in ``fuel-ccp``) for the Neutron database pool +size, as well as max overflow parameter are not enough for the +environment with big enough load on Neutron API. + +Solution +~~~~~~~~ + +Modify ``ccp-repos/fuel-ccp-neutron/service/files/neutron.conf.j2`` file to +contain the following configuration parameters:: + + [database] + + max_pool_size = 30 + + max_overflow = 60 + +No access to OpenStack VM from tenant network +--------------------------------------------- + +Symptoms +~~~~~~~~ + +Some of the VMs representing the slices are not reachable via tenant +network. For example:: + + | 93b95c73-f849-4ffb-9108-63cf262d3a9f | cassandra_vm_0 | + ACTIVE | slice0-node162-net=11.62.0.8, 10.144.1.35 | + ubuntu-software-config-last | + + root@node1:~# ssh -i .ssh/slace ubuntu@10.144.1.35 + Connection closed by 10.144.1.35 port 22 + +It is unreachable from tenant network as well. For example from instance +``b1946719-b401-447d-8103-cc43b03b1481`` which has been spawned by the same +Heat stack on the same compute node (``node162``): +`http://paste.openstack.org/show/593486/ `__ + +Root cause and solution +~~~~~~~~~~~~~~~~~~~~~~~ + +Still under investigation. Root cause not clear yet. **This issue is blocking +running workloads against deployed slices.** + +OpenStack services don’t handle PXC pseudo-deadlocks +---------------------------------------------------- + +Symptoms +~~~~~~~~ + +When run in parallel, create operations of lots of resources were +failing with DBError saying that Percona Xtradb Cluster identified a +deadlock and transaction should be restarted. + +Root cause +~~~~~~~~~~ + +oslo.db is responsible for wrapping errors received from DB into proper +classes so that services can restart transactions if similar errors +occur, but it didn’t expect error in format that is being sent by +Percona. After we fixed this, we still experienced similar errors +because not all transactions that could be restarted were properly +decorated in Nova code. + +Upstream issues +~~~~~~~~~~~~~~~ + +`Bug `__ has been +fixed by Roman Podolyaka’s +`CR `__ and +`backported `__ to Newton. It +fixes Percona deadlock error detection, but there’s at least one place +in Nova to be fixed (TBD) + +Live migration failed with live_migration_uri configuration +------------------------------------------------------------- + +Symptoms +~~~~~~~~ + +With ``live_migration_uri`` configuration, live migrations fails because +one compute host can’t connect to a libvirt on another host. + +Root cause +~~~~~~~~~~ + +We can’t specify which IP address to use in template in +``live_migration_uri``, so it was trying to use address from first +interface which happened to be in PXE network while libvirt listens in +private network. We couldn’t use ``live_migration_inbound_addr`` which +would solve this problem because of a problem in upstream Nova. + +Upstream issues +~~~~~~~~~~~~~~~ + +A `bug `__ in Nova has +been `fixed `__ and +`backported `__ to Newton. We +`switched `__ to using +``live_migration_inbound_addr`` after that. + +Contributors +============ + +The following people have credits for contributing to this +document: + +* Dina Belova + +* Yuriy Taraday