Document issues in scale testing of fuel-ccp

This document provides details of issues found in scale testing of fuel-ccp tool and OpenStack containerized control plane installed by it. It is split into two sections, one dedicated to Kubernetes issues, and the other to OpenStack-specific problems. Change-Id: Ided3c47f7425d5e8b5fe2bdea9e794fbb4d550ec
2017-01-05 14:20:26 -08:00 · 2017-01-05 14:20:26 -08:00 · fb8a2f1e35
commit fb8a2f1e35
parent b37458301a
3 changed files with 894 additions and 0 deletions
--- a/doc/source/index.rst
+++ b/doc/source/index.rst
@ -24,6 +24,7 @@ Contents
    labs/index.rst
    test_plans/index
    test_results/index
    issues/index
 .. raw:: pdf
--- a/doc/source/issues/index.rst
+++ b/doc/source/issues/index.rst
@ -0,0 +1,12 @@
 .. raw:: pdf
    PageBreak oneColumn
 =======================
 Issues Analysis
 =======================
 .. toctree::
    :maxdepth: 2
    scale_testing_issues
--- a/doc/source/issues/scale_testing_issues.rst
+++ b/doc/source/issues/scale_testing_issues.rst
@ -0,0 +1,881 @@
 .. _scale_testing_issues:
 ======================================
 Kubernetes Issues At Scale 900 Minions
 ======================================
 Glossary
 ========
 -  **Kubernetes** is an open-source system for automating deployment,
   scaling, and management of containerized applications.
 -  **fuel-ccp**: CCP stands for “Containerized Control Plane”. The goal
   of this project is to make building, running and managing
   production-ready OpenStack containers on top of Kubernetes an
   easy task for operators.
 -  **OpenStack** is a cloud operating system that controls large pools
   of compute, storage, and networking resources throughout a
   datacenter, all managed through a dashboard that gives
   administrators control while empowering their users to provision
   resources through a web interface.
 -  **Heat** is an OpenStack service to orchestrate composite cloud
   applications using a declarative template format through an
   OpenStack-native REST API.
 -  **Slice** is a set of 6 VMs:
   -  1x Yahoo! Benchmark (ycsb)
   -  1x Cassandra
   -  1x Magento
   -  1x Wordpress
   -  2x Idle VM
 Setup
 =====
 We had about 181 bare metal machines, 3 of them were used for Kubernetes
 control plane services placement (API servers, ETCD, Kubernetes
 scheduler, etc.), others had 5 virtual machines on each node, where
 every VM was used as a Kubernetes minion node.
 Each bare metal node has the following specifications:
 -  HP ProLiant DL380 Gen9
 -  **CPU** - 2x Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
 -  **RAM** - 264G
 -  **Storage** - 3.0T on RAID on HP Smart Array P840 Controller, HDD -
   12 x HP EH0600JDYTL
 -  **Network** - 2x Intel Corporation Ethernet 10G 2P X710
 Running OpenStack cluster (from Kubernetes point of view) is represented
 with the following numbers:
 1. OpenStack control plane services are running within ~80 pods on 6
   nodes
 2. ~4500 pods are spread across all remaining nodes, 5 pods on each.
 Kubernetes architecture analysis obstacles
 ==========================================
 During the 900 nodes tests we used `Prometheus <https://prometheus.io/>`__
 monitoring tool for the
 verification of the resources consumption and the load put on core
 system, Kubernetes and OpenStack levels services. During one of the
 Prometheus configuration optimisations old data from the Prometheus
 storage was deleted to improve Prometheus API speed, and this old data
 included 900 nodes cluster information, therefore we have only partial
 data being available for the post run investigation. This fact,
 although, does not influence overall reference architecture analysis, as
 all issues, that were observed during the containerized OpenStack setup
 testing, were thoughtfully documented and debugged.
 To prevent monitoring data loss in future (Q1 2017 timeframe and
 further) we need to proceed with the following improvements of the
 monitoring setup:
 1. Prometheus by default is more optimized to be used as real time
   monitoring / alerting system, and there is an official
   recommendation from Prometheus developers team to keep monitoring
   data retention for about 15 days to keep tool working in quick
   and responsive manner. To keep old data for the post-usage
   analytics purposes external store requires to be configured.
 2. We need to reconfigure monitoring tool (Prometheus) to include data
   backup to one of the persistent time series databases (e.g.
   InfluxDB / Cassandra / OpenTSDB) that’s supported as an external
   persistent data store by Prometheus. This will allow us to store
   old data for extended amount of time for post-processing needs.
 Observed issues
 ===============
 Huge load on kube-apiserver
 ---------------------------
 Symptoms
 ~~~~~~~~
 Both API servers, running in Kubernetes cluster, were utilising up to
 2000% of CPU (up to 45% of total node compute performance capacity)
 after we migrated them to hardware nodes. Initial setup with all nodes
 (including Kubernetes control plane nodes) running on virtualized
 environment was showing not workable API servers at all.
 Root cause
 ~~~~~~~~~~
 All services that are placed not on Kubernetes masters (``kubelet``,
 ``kube-proxy`` on all minions) access ``kube-apiserver`` via local
 ``ngnix`` proxy.
 Most of those requests are watch requests that stay mostly idle after
 they are initiated (most timeouts on them are defined to be about 5-10
 minutes). ``nginx`` was configured to cut idle connections in 3 seconds,
 which makes all clients to reconnect and (the worst) restart aborted SSL
 session. On the server side it makes ``kube-apiserver`` consume up to 2000%
 CPU resources and other requests become very slow.
 Solution
 ~~~~~~~~
 Set ``proxy_timeout`` parameter to 10 minutes in ``nginx.conf`` config
 file, which should be more than enough not to cut SSL connections before
 requests time out by themselves. After this fix was applied, one
 api-server became to consume 100% of CPU (about 2% of total node compute
 performance capacity), the second one about 200% (about 4% of total node
 compute performance capacity) of CPU (with average response time 200-400
 ms).
 Upstream issue (fixed)
 ~~~~~~~~~~~~~~~~~~~~~~
 Make Kargo deployment tool set ``proxy_timeout`` to 10 minutes:
 `issue <https://github.com/kubernetes-incubator/kargo/issues/655>`__
 fixed with `pull request <https://github.com/kubernetes-incubator/kargo/pull/656>`__
 by Fuel CCP team.
 KubeDNS cannot handle big cluster load with default settings
 ------------------------------------------------------------
 Symptoms
 ~~~~~~~~
 When deploying OpenStack cluster on this scale, ``kubedns`` becomes
 unresponsive because of high load. This end up with very often error
 appearing in logs of ``dnsmasq`` container in ``kubedns`` pod::
    Maximum number of concurrent DNS queries reached.
 Also ``dnsmasq`` containers sometimes get restarted due to hitting high
 memory limit.
 Root cause
 ~~~~~~~~~~
 First of all, ``kubedns`` seems to fail often on high load (or even without
 load), during the experiment we observed continuous kubedns container
 restarts even on empty (but big enough) Kubernetes cluster. Restarts
 are caused by liveness check failing, although nothing notable is
 observed in any logs.
 Second, ``dnsmasq`` should have taken load off ``kubedns``, but it needs some
 tuning to behave as expected for big load, otherwise it is useless.
 Solution
 ~~~~~~~~
 This requires several levels of fixing:
 1. Set higher limits for ``dnsmasq`` containers: they take on most of the
   load.
 2. Add more replicas to ``kubedns`` replication controller (we decided to
   stop on 6 replicas, as it solved the observed issue - for bigger
   clusters it might be needed to increase this number even more).
 3. Increase number of parallel connections ``dnsmasq`` should handle (we
   used ``--dns-forward-max=1000`` which is recommendaed parameter setup
   in ``dnsmasq`` manuals)
 4. Increase size of cache in ``dnsmasq``: it has hard limit of 10000 cache
   entries which seems to be reasonable amount.
 5. Fix ``kubedns`` to handle this behaviour in proper way.
 Upstream issues (partially fixed)
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 #1 and #2 are fixed by making them configurable in Kargo by Kubernetes
 team:
 `issue <https://github.com/kubernetes-incubator/kargo/issues/643>`__,
 `pull request <https://github.com/kubernetes-incubator/kargo/pull/652>`__.
 Other fixes are still being implemented as of time of this publication.
 Kubernetes scheduler is ineffective with pod antiaffinity
 ---------------------------------------------------------
 Symptoms
 ~~~~~~~~
 It takes significant amount of time for scheduler to process pods with
 pod antiaffinity rules specified on them. It is spending about **2-3
 seconds** on each pod which makes time needed to deploy OpenStack
 cluster on 900 nodes unexpectedly long (about 3h for just scheduling).
 Antiaffinity rules are required to be used for OpenStack deployment to
 prevent several OpenStack compute nodes to be mixed and messed to one
 Kubernetes Minion node.
 Root cause
 ~~~~~~~~~~
 According to profiling results, most of the time is spent on creating
 new Selectors to match existing pods against them, which triggers
 validation step. Basically we have O(N^2) unnecessary validation steps
 (N - number of pods), even if we have just 5 deployments entities
 covering most of the nodes.
 Solution
 ~~~~~~~~
 Specific optimization that speeds up scheduling time up to about 300
 ms/pod was required in this case. It’s still slow in terms of common
 sense (about 30m spent just on pods scheduling for 900 nodes OpenStack
 cluster), but is close to be reasonable. This solution lowers number of
 very expensive operations to O(N), which is better, but still depends on
 number of pods instead of deployments, so there is space for future
 improvement.
 Upstream issues
 ~~~~~~~~~~~~~~~
 Optimization merged into master: `pull
 request <https://github.com/kubernetes/kubernetes/pull/37691>`__;
 backported to 1.5 branch (will release in 1.5.2 release): `pull
 request <https://github.com/kubernetes/kubernetes/pull/38693>`__.
 Kubernetes scheduler needs to be deployed on separate node
 ----------------------------------------------------------
 Symptoms
 ~~~~~~~~
 During huge OpenStack cluster deployment against pre-deployed
 Kubernetes ``scheduler``, ``controller-manager`` and ``apiserver`` start
 competing for CPU cycles as all of them get big load. Scheduler is more
 resource-hungry (see next problem), so we need a way to deploy it
 separately.
 Root Cause
 ~~~~~~~~~~
 The same problem with Kubsernetes scheduler efficiency at scale of about
 1000 nodes as in the issue above.
 Solution
 ~~~~~~~~
 Kubernetes scheduler was moved to a separate node manually, all other
 schedulers were manually killed to prevent them from moving to other
 nodes.
 Upstream issues
 ~~~~~~~~~~~~~~~
 `Issue <https://github.com/kubernetes-incubator/kargo/issues/834>`__
 created in Kargo installer Github repository.
 kube-apiserver have low default rate limit
 ------------------------------------------
 Symptoms
 ~~~~~~~~
 Different services start receiving “429 Rate Limit Exceeded” HTTP error
 even though ``kube-apiservers`` can take more load. It is linked to a
 scheduler bug (see below).
 Solution
 ~~~~~~~~
 Raise rate limit for ``kube-apiserver process`` via ``--max-requests-inflight``
 option. It defaults to 400, in our case it became workable at 2000. This
 number should be configurable in Kargo deployment tool, as for bigger
 deployments it might be required to increase it accordingly.
 Upstream issues
 ~~~~~~~~~~~~~~~
 Upstream issue or pull request was not created for this issue.
 Kubernetes scheduler can schedule wrongly
 -----------------------------------------
 Symptoms
 ~~~~~~~~
 When many pods are being created (~4500 in our case of OpenStack
 deployment) and faced with 429 error from ``kube-apiserver`` (see above),
 the scheduler can schedule several pods of the same deployment on one node
 in violation of pod antiaffinity rule on them.
 Root cause
 ~~~~~~~~~~
 This issue arises due to scheduler cache being evicted before the pod
 actually processed.
 Upstream issues
 ~~~~~~~~~~~~~~~
 `Pull
 request <https://github.com/kubernetes/kubernetes/pull/38503>`__ accepted
 in Kubernetes upstream.
 Docker become unresponsive at random
 ------------------------------------
 Symptoms
 ~~~~~~~~
 Docker process sometimes hangs on several nodes, which results in
 timeouts in ``kubelet`` logs and pods cannot be spawned or terminated
 successfully on the affected minion node. Although bunch of similar
 issues has been fixed in Docker since 1.11, we still are observing those
 symptoms.
 Workaround
 ~~~~~~~~~~
 Docker daemon logs does not contain any notable information, so we had
 to restart docker service on the affected node (during those experiments
 we used Docker 1.12.3, but we have observed similar symptoms in 1.13
 as well).
 Calico start up time is too long
 --------------------------------
 Symptoms
 ~~~~~~~~
 If we have to kill a Kubernetes node, Calico requires ~5 minutes to
 reestablish all mesh connections.
 Root cause
 ~~~~~~~~~~
 Calico uses BGP, so without route reflector it has to do full-mesh
 between all nodes in cluster.
 Solution
 ~~~~~~~~
 We need to switch to using route reflectors in our clusters. Then every
 node needs only to establish connections to all reflectors.
 Upstream Issues
 ~~~~~~~~~~~~~~~
 None. For production use, architecutre of Calico network should be
 adjusted to use route reflectors set up on selected nodes or on
 switching fabric hardware. This will reduce the number of BGP
 connections per node and speed up the Calico startup.
 ===========================================
 OpenStack Issues At Scale 200 Compute Nodes
 ===========================================
 Workloads testing approach
 ==========================
 The goal of this test was to investigate OpenStack and Kubernetes
 behavior under load. This load needs to emulate end-user traffic running
 on OpenStack servers (guest systems) with different types of
 applications running against cloud system. We were interested in the
 following metrics: CPU usage, Disk statistics, Network load, Memory used
 and IO stats on each controller nodes and chosen set of the compute
 nodes.
 Testing preparation steps
 -------------------------
 To preare for testing, several steps should be made:
 -  [0] Pre-deploy Kubernetes + OpenStack Newton environment (Kargo +
   fuel-ccp tools)
 -  [1] Establish advanced monitoring of the testing environment (on
   all three layers - system, Kubernetes and OpenStack)
 -  [2] Prepare to automatically deploy slices against OpenStack and
   configure applications running within them
 -  [3] Prepare to run workloads against applications running within
   slices
 [1] was achieved through configuring Prometheus monitoring and alerting
 tool with Collectd and Telegraf collectors. Separated monitoring
 document is under preparation to present significant effort spent on
 this task.
 [2] was automated by using Heat OpenStack orchestration service
 `templates <https://github.com/ayasakov/hot-ansible-templates>`__.
 [3] was automated mostly through generating HTTP trafic native for the
 applications under test using Locust.IO tool
 `templates <https://github.com/ayasakov/locustio-workloads>`__.
 Cassandra VM workload was automated through Yahoo! Benchmark (ycsb)
 running on neighbour VM of the same slice.
 Step [1] was tested against 900 nodes Kubernetes cluster described in
 section above and later against all test environments we had, steps [2] and
 [3] were verified against small (20 nodes) testing environment in
 parallel with finalizing step [1] workability. During this verification
 several issues with recently introduced Heat support to fuel-ccp were
 observed and fixed. Later all of those steps were assumed to be run
 against 200 bare metal nodes lab, but bunch of issues was observed
 during step [2] that blocked finishing the testing. All of those issues
 (including those found during small environment verification) are listed
 below.
 Heat/fuel-ccp integration issues
 ================================
 Lack of Heat domain configuration
 ---------------------------------
 Symptoms
 ~~~~~~~~
 Authentication errors during Heat stacks (representing workloads testing
 slices on each compute node) creation.
 Root cause
 ~~~~~~~~~~
 During OpenStack Newton timeframe Orchestration has started to require
 additional information in the Identity service to manage stacks - in
 particular, configuration of heat domain that would contain stacks
 projects and users, creating of ``heat_domain_admin`` user to manage
 projects and users in the heat domain and adding the admin role to the
 ``heat_domain_admin`` user in the heat domain to enable administrative
 stack management privileges by the ``heat_domain_admin`` user. Please take
 a look on `OpenStack Newton configuration
 guide <http://docs.openstack.org/project-install-guide/orchestration/newton/install-ubuntu.html>`__
 for more information.
 Solution
 ~~~~~~~~
 Set up needed configuration steps in ``fuel-ccp``:
 -  `Patch #1 <https://review.openstack.org/#/c/400846/>`__
 -  `Patch #2 <https://review.openstack.org/#/c/402045/>`__
 Lack of heat-api-cfn service configuration in fuel-ccp
 ------------------------------------------------------
 Symptoms
 ~~~~~~~~
 Applications configured in Heat templates are not receiving their
 configurations and data required for the succeeded applications
 workability.
 Root cause
 ~~~~~~~~~~
 Initial Heat support in ``fuel-ccp`` did not include ``heat-api-cfn``
 service, which is used by Heat for some config transports. This service is
 necessary to support default ``user_data_format`` (``HEAT_CFNTOOLS``), used
 for most applications-related Heat templates.
 Solution
 ~~~~~~~~
 Add new ``heat-api-cfn`` image, which will be used to create new service and
 configure all necessary endpoints for it in ``fuel-ccp`` tool. Patches to
 ``fuel-ccp``:
 -  `Patch #1 <https://review.openstack.org/#/c/401138/>`__
 -  `Patch #2 <https://review.openstack.org/#/c/401174/>`__
 Heat endpoint not reachable from virtual machines
 -------------------------------------------------
 Symptoms
 ~~~~~~~~
 Heat API and Heat API CFN are not reachable from OpenStack VM, even
 though some of the applications configurations require such access.
 Root cause
 ~~~~~~~~~~
 Fuel CCP deploys Heat services with default configuration and changes
 ``endpoint_type`` from ``publicURL`` to ``internalURL``. However,
 such configuration in Kubernetes cluster is not enough for several types
 of Heat stack resources like ``OS::Heat::Waitcondition`` and 
 ``OS::Heat::SoftwareDeployment``, which require callbacks to Heat API
 or Heat API CFN. Due to Kubernetes  architecture, it's not possible to
 do such callback on the default port value (for ``heat-api`` it's - 8004
 and 8000 for ``heat-api-cfn``).
 Solution
 ~~~~~~~~
 There are two ways to fix described above issues:
 -  Out of the box, which requires just adding some data to .ccp.yaml
   configuration file. This workaround can be used prior OpenStack
   cluster deployment during future OpenStack cluster description.
 -  Second which requires some manual actions and can be processed when
   Openstack is already deployed and cloud administrator can change
   only one component configuration.
 Both of those solutions are described in the `patch to Fuel CCP
 Documentation <https://review.openstack.org/#/c/404114>`__. Please
 notice that additionally `patch to Fuel CCP
 codebase <https://review.openstack.org/#/c/405263/>`__ need to be
 applied to make some of the Kubernetes options configurable.
 Glance+Heat authentication misconfiguration
 -------------------------------------------
 Symptoms
 ~~~~~~~~
 During Heat stacks (representing workloads testing slices) creation,
 random Glance authentication errors were observed.
 Root cause
 ~~~~~~~~~~
 Service-specific users should have admin role in "service" project and
 should not belong to user-facing admin project. Initially Fuel-CCP
 contained Glance and Heat services misconfigured and several patches
 were required to fix it.
 Solution
 ~~~~~~~~
 -  `Patch #1 <https://review.openstack.org/#/c/409033/>`__
 -  `Patch #2 <https://review.openstack.org/#/c/409037/>`__
 Workloads Testing Issues
 ========================
 Random loss of connection to MySQL
 ----------------------------------
 Symptoms
 ~~~~~~~~
 During Heat stacks (representing slices) creation some of the stacks are
 moved to ERROR state with the following traceback being found in the
 Heat logs::
    2016-12-11 16:59:22.220 1165 ERROR nova.api.openstack.extensions
    DBConnectionError: (pymysql.err.OperationalError) (2003, "Can't
    connect to MySQL server on 'galera.ccp.svc.cluster.local' ([Errno
    -2] Name or service not known)")
 Root cause
 ~~~~~~~~~~
 This turned out to be exactly the same issue with KubeDNS being unable
 to handle high loads as described in Kubernetes Issues section above.
 First of all, ``kubedns`` seems to fail often on high load (or even without
 load), during the experiment we observed continuous kubedns container
 restarts even on empty (but big enough) Kubernetes cluster. Restarts
 are caused by liveness check failing, although nothing notable is
 observed in any logs.
 Second, ``dnsmasq`` should have taken load off ``kubedns``, but it needs some
 tuning to behave as expected for big load, otherwise it is useless.
 Solution
 ~~~~~~~~
 See above in Kubernetes Issues section.
 Slow Heat API requests (`Bug 1653088 <https://bugs.launchpad.net/fuel-ccp/+bug/1653088>`__)
 -----------------------------------------------------------------------------------------
 Symptoms
 ~~~~~~~~
 Requests to the Heat API for regular requests took more than a minute of
 time. Example of such request can be presented with `this
 traceback <http://paste.openstack.org/show/593132/>`__ showing time
 needed for listing details of the specific stack
 Root cause
 ~~~~~~~~~~
 Fuel-ccp team proposed that it might be a hidden race condition between
 multiple Heat workers, and it's not yet clear where is this race
 happening. This is still under debug
 Workaround
 ~~~~~~~~~~
 Set up number of Heat workers to 1 until real cause of this issue will
 be found.
 OpenStack VMs cannot fetch required metadata
 --------------------------------------------
 Symptoms
 ~~~~~~~~
 OpenStack servers do not receive metadata with applications-specific
 information. There is a following
 `trace <http://paste.openstack.org/show/593337/>`__ in Heat logs.
 Root cause
 ~~~~~~~~~~
 Prior pushing metadata to OpenStack VMs, Heat is storing information
 about stack under creation to its own database. Starting with OpenStack
 Newton Heat works in so-called “convergence” mode by default, making
 sure that Heat engine process several requests at one time. This
 parallel task processing might end up with race conditions during DB
 access.
 Workaround
 ~~~~~~~~~~
 Turn off Heat convergence engine::
    convergence_engine=False
 RPC timeouts during Heat resources validation
 ---------------------------------------------
 Symptoms
 ~~~~~~~~
 During Heat resources validation, this process is failed with the
 following `trace <http://paste.openstack.org/show/593112/>`__ being
 caught.
 Root cause
 ~~~~~~~~~~
 Initial assumption was that there might be too small
 ``rpc_response_timeout`` parameter being set up in Heat configuration
 file. This parameter was set up to 10 minutes exactly like that was used
 in MOS::
    rpc_response_timeout = 600
 After that was done, no more RPC timeouts were observed.
 Solution
 ~~~~~~~~
 Set up ``rpc_response_timeout = 600`` as that’s tested value for
 generations of MOS releases.
 Overloaded Memcache service (`Bug 1653089 <https://bugs.launchpad.net/fuel-ccp/+bug/1653089>`__)
 ----------------------------------------------------------------------------------------------
 Symptoms
 ~~~~~~~~
 On 200 nodes with workloads deploying OpenStack is actively using cache
 (``memcached``). At some point of time requests to ``memcached`` begin to be
 processed really slow.
 Root cause
 ~~~~~~~~~~
 Memcache size is 256 MB by default in Fuel CCP, which is really small
 size. That was a reason for great amount of retransmissions being
 processed.
 Solution
 ~~~~~~~~
 Increase cache size up to 30G in ``fuel-ccp`` configuration file for huge or
 loaded deployments::
    configs:
        memcached:
            address: 0.0.0.0
            port:
            cont: 11211
            ram: 30720
 Incorrect status information for Neutron agents
 -----------------------------------------------
 Symptoms
 ~~~~~~~~
 While asking for Neutron agents status through OpenStack client, all of
 them are displayed in down state, although due to the environment
 behaviour it does not seem to be true.
 Root cause
 ~~~~~~~~~~
 The root cause of this issue is hidden in OpenStackSDK refactoring that
 caused various OSC networking commands to fail. The full discussion
 regarding this issue can be found in `upstream
 bug <https://bugs.launchpad.net/python-openstackclient/+bug/1652317>`__.
 Workaround
 ~~~~~~~~~~
 Use Neutron client directly to gather Neutron services statuses until
 `bug <https://bugs.launchpad.net/python-openstackclient/+bug/1652317>`__
 will be fixed.
 Nova client not working (`Bug 1653075 <https://bugs.launchpad.net/fuel-ccp/+bug/1653075>`__)
 ------------------------------------------------------------------------------------------
 Symptoms
 ~~~~~~~~
 All commands running through Nova client end up with the `stack
 trace <http://paste.openstack.org/show/593531/>`__.
 Root cause
 ~~~~~~~~~~
 In debug mode it’s seen that client tries to use HTTP protocol for the
 Nova requests, e.g. http://compute.ccp.external:8443/v2.1/, although
 HTTPS protocol is required for this client-server conversation.
 Solution
 ~~~~~~~~
 Add the following lines to
 ``ccp-repos/fuel-ccp-nova/service/files/nova.conf.j2`` file::
    [DEFAULT]
    secure_proxy_ssl_header = HTTP_X_FORWARDED_PROTO
 Neutron server timeouts (`Bug 1653077 <https://bugs.launchpad.net/fuel-ccp/+bug/1653077>`__)
 ------------------------------------------------------------------------------------------
 Symptoms
 ~~~~~~~~
 Neutron server is not able to process requests to it with the following
 error being caught in the Neutron logs:
 `trace <http://paste.openstack.org/show/593483/>`__
 Root cause
 ~~~~~~~~~~
 Default values (that are used in ``fuel-ccp``) for the Neutron database pool
 size, as well as max overflow parameter are not enough for the
 environment with big enough load on Neutron API.
 Solution
 ~~~~~~~~
 Modify ``ccp-repos/fuel-ccp-neutron/service/files/neutron.conf.j2`` file to
 contain the following configuration parameters::
    [database]
    max_pool_size = 30
    max_overflow = 60
 No access to OpenStack VM from tenant network
 ---------------------------------------------
 Symptoms
 ~~~~~~~~
 Some of the VMs representing the slices are not reachable via tenant
 network. For example::
    | 93b95c73-f849-4ffb-9108-63cf262d3a9f | cassandra_vm_0 |
    ACTIVE | slice0-node162-net=11.62.0.8, 10.144.1.35 |
    ubuntu-software-config-last |
    root@node1:~# ssh -i .ssh/slace ubuntu@10.144.1.35
    Connection closed by 10.144.1.35 port 22
 It is unreachable from tenant network as well. For example from instance
 ``b1946719-b401-447d-8103-cc43b03b1481`` which has been spawned by the same
 Heat stack on the same compute node (``node162``):
 `http://paste.openstack.org/show/593486/ <http://paste.openstack.org/show/593486/>`__
 Root cause and solution
 ~~~~~~~~~~~~~~~~~~~~~~~
 Still under investigation. Root cause not clear yet. **This issue is blocking
 running workloads against deployed slices.**
 OpenStack services don’t handle PXC pseudo-deadlocks
 ----------------------------------------------------
 Symptoms
 ~~~~~~~~
 When run in parallel, create operations of lots of resources were
 failing with DBError saying that Percona Xtradb Cluster identified a
 deadlock and transaction should be restarted.
 Root cause
 ~~~~~~~~~~
 oslo.db is responsible for wrapping errors received from DB into proper
 classes so that services can restart transactions if similar errors
 occur, but it didn’t expect error in format that is being sent by
 Percona. After we fixed this, we still experienced similar errors
 because not all transactions that could be restarted were properly
 decorated in Nova code.
 Upstream issues
 ~~~~~~~~~~~~~~~
 `Bug <https://bugs.launchpad.net/oslo.db/+bug/1648818>`__ has been
 fixed by Roman Podolyaka’s
 `CR <https://review.openstack.org/409194>`__ and
 `backported <https://review.openstack.org/409679>`__ to Newton. It
 fixes Percona deadlock error detection, but there’s at least one place
 in Nova to be fixed (TBD)
 Live migration failed with live_migration_uri configuration
 -------------------------------------------------------------
 Symptoms
 ~~~~~~~~
 With ``live_migration_uri`` configuration, live migrations fails because
 one compute host can’t connect to a libvirt on another host.
 Root cause
 ~~~~~~~~~~
 We can’t specify which IP address to use in template in
 ``live_migration_uri``, so it was trying to use address from first
 interface which happened to be in PXE network while libvirt listens in
 private network. We couldn’t use ``live_migration_inbound_addr`` which
 would solve this problem because of a problem in upstream Nova.
 Upstream issues
 ~~~~~~~~~~~~~~~
 A `bug <https://bugs.launchpad.net/nova/+bug/1638625>`__ in Nova has
 been `fixed <https://review.openstack.org/398956>`__ and
 `backported <https://review.openstack.org/404810>`__ to Newton. We
 `switched <https://review.openstack.org/407708>`__ to using
 ``live_migration_inbound_addr`` after that.
 Contributors
 ============
 The following people have credits for contributing to this
 document:
 * Dina Belova <dbelova@mirantis.com>
 * Yuriy Taraday <ytaraday@mirantis.com>