Remove workloads testing data, that became private

Change-Id: Ib0dad00b1261b29f4ce8620f500f9eefc3943c6a
This commit is contained in:
Dina Belova 2017-02-01 14:18:48 -08:00
parent 65ab1dcb96
commit 8726882fc0

View File

@ -21,22 +21,6 @@ Glossary
administrators control while empowering their users to provision administrators control while empowering their users to provision
resources through a web interface. resources through a web interface.
- **Heat** is an OpenStack service to orchestrate composite cloud
applications using a declarative template format through an
OpenStack-native REST API.
- **Slice** is a set of 6 VMs:
- 1x Yahoo! Benchmark (ycsb)
- 1x Cassandra
- 1x Magento
- 1x Wordpress
- 2x Idle VM
Setup Setup
===== =====
@ -374,502 +358,6 @@ adjusted to use route reflectors set up on selected nodes or on
switching fabric hardware. This will reduce the number of BGP switching fabric hardware. This will reduce the number of BGP
connections per node and speed up the Calico startup. connections per node and speed up the Calico startup.
===========================================
OpenStack Issues At Scale 200 Compute Nodes
===========================================
Workloads testing approach
==========================
The goal of this test was to investigate OpenStack and Kubernetes
behavior under load. This load needs to emulate end-user traffic running
on OpenStack servers (guest systems) with different types of
applications running against cloud system. We were interested in the
following metrics: CPU usage, Disk statistics, Network load, Memory used
and IO stats on each controller nodes and chosen set of the compute
nodes.
Testing preparation steps
-------------------------
To preare for testing, several steps should be made:
- [0] Pre-deploy Kubernetes + OpenStack Newton environment (Kargo +
fuel-ccp tools)
- [1] Establish advanced monitoring of the testing environment (on
all three layers - system, Kubernetes and OpenStack)
- [2] Prepare to automatically deploy slices against OpenStack and
configure applications running within them
- [3] Prepare to run workloads against applications running within
slices
[1] was achieved through configuring Prometheus monitoring and alerting
tool with Collectd and Telegraf collectors. Separated monitoring
document is under preparation to present significant effort spent on
this task.
[2] was automated by using Heat OpenStack orchestration service
`templates <https://github.com/ayasakov/hot-ansible-templates>`__.
[3] was automated mostly through generating HTTP trafic native for the
applications under test using Locust.IO tool
`templates <https://github.com/ayasakov/locustio-workloads>`__.
Cassandra VM workload was automated through Yahoo! Benchmark (ycsb)
running on neighbour VM of the same slice.
Step [1] was tested against 900 nodes Kubernetes cluster described in
section above and later against all test environments we had, steps [2] and
[3] were verified against small (20 nodes) testing environment in
parallel with finalizing step [1] workability. During this verification
several issues with recently introduced Heat support to fuel-ccp were
observed and fixed. Later all of those steps were assumed to be run
against 200 bare metal nodes lab, but bunch of issues was observed
during step [2] that blocked finishing the testing. All of those issues
(including those found during small environment verification) are listed
below.
Heat/fuel-ccp integration issues
================================
Lack of Heat domain configuration
---------------------------------
Symptoms
~~~~~~~~
Authentication errors during Heat stacks (representing workloads testing
slices on each compute node) creation.
Root cause
~~~~~~~~~~
During OpenStack Newton timeframe Orchestration has started to require
additional information in the Identity service to manage stacks - in
particular, configuration of heat domain that would contain stacks
projects and users, creating of ``heat_domain_admin`` user to manage
projects and users in the heat domain and adding the admin role to the
``heat_domain_admin`` user in the heat domain to enable administrative
stack management privileges by the ``heat_domain_admin`` user. Please take
a look on `OpenStack Newton configuration
guide <http://docs.openstack.org/project-install-guide/orchestration/newton/install-ubuntu.html>`__
for more information.
Solution
~~~~~~~~
Set up needed configuration steps in ``fuel-ccp``:
- `Patch #1 <https://review.openstack.org/#/c/400846/>`__
- `Patch #2 <https://review.openstack.org/#/c/402045/>`__
Lack of heat-api-cfn service configuration in fuel-ccp
------------------------------------------------------
Symptoms
~~~~~~~~
Applications configured in Heat templates are not receiving their
configurations and data required for the succeeded applications
workability.
Root cause
~~~~~~~~~~
Initial Heat support in ``fuel-ccp`` did not include ``heat-api-cfn``
service, which is used by Heat for some config transports. This service is
necessary to support default ``user_data_format`` (``HEAT_CFNTOOLS``), used
for most applications-related Heat templates.
Solution
~~~~~~~~
Add new ``heat-api-cfn`` image, which will be used to create new service and
configure all necessary endpoints for it in ``fuel-ccp`` tool. Patches to
``fuel-ccp``:
- `Patch #1 <https://review.openstack.org/#/c/401138/>`__
- `Patch #2 <https://review.openstack.org/#/c/401174/>`__
Heat endpoint not reachable from virtual machines
-------------------------------------------------
Symptoms
~~~~~~~~
Heat API and Heat API CFN are not reachable from OpenStack VM, even
though some of the applications configurations require such access.
Root cause
~~~~~~~~~~
Fuel CCP deploys Heat services with default configuration and changes
``endpoint_type`` from ``publicURL`` to ``internalURL``. However,
such configuration in Kubernetes cluster is not enough for several types
of Heat stack resources like ``OS::Heat::Waitcondition`` and
``OS::Heat::SoftwareDeployment``, which require callbacks to Heat API
or Heat API CFN. Due to Kubernetes architecture, it's not possible to
do such callback on the default port value (for ``heat-api`` it's - 8004
and 8000 for ``heat-api-cfn``).
Solution
~~~~~~~~
There are two ways to fix described above issues:
- Out of the box, which requires just adding some data to .ccp.yaml
configuration file. This workaround can be used prior OpenStack
cluster deployment during future OpenStack cluster description.
- Second which requires some manual actions and can be processed when
Openstack is already deployed and cloud administrator can change
only one component configuration.
Both of those solutions are described in the `patch to Fuel CCP
Documentation <https://review.openstack.org/#/c/404114>`__. Please
notice that additionally `patch to Fuel CCP
codebase <https://review.openstack.org/#/c/405263/>`__ need to be
applied to make some of the Kubernetes options configurable.
Glance+Heat authentication misconfiguration
-------------------------------------------
Symptoms
~~~~~~~~
During Heat stacks (representing workloads testing slices) creation,
random Glance authentication errors were observed.
Root cause
~~~~~~~~~~
Service-specific users should have admin role in "service" project and
should not belong to user-facing admin project. Initially Fuel-CCP
contained Glance and Heat services misconfigured and several patches
were required to fix it.
Solution
~~~~~~~~
- `Patch #1 <https://review.openstack.org/#/c/409033/>`__
- `Patch #2 <https://review.openstack.org/#/c/409037/>`__
Workloads Testing Issues
========================
Random loss of connection to MySQL
----------------------------------
Symptoms
~~~~~~~~
During Heat stacks (representing slices) creation some of the stacks are
moved to ERROR state with the following traceback being found in the
Heat logs::
2016-12-11 16:59:22.220 1165 ERROR nova.api.openstack.extensions
DBConnectionError: (pymysql.err.OperationalError) (2003, "Can't
connect to MySQL server on 'galera.ccp.svc.cluster.local' ([Errno
-2] Name or service not known)")
Root cause
~~~~~~~~~~
This turned out to be exactly the same issue with KubeDNS being unable
to handle high loads as described in Kubernetes Issues section above.
First of all, ``kubedns`` seems to fail often on high load (or even without
load), during the experiment we observed continuous kubedns container
restarts even on empty (but big enough) Kubernetes cluster. Restarts
are caused by liveness check failing, although nothing notable is
observed in any logs.
Second, ``dnsmasq`` should have taken load off ``kubedns``, but it needs some
tuning to behave as expected for big load, otherwise it is useless.
Solution
~~~~~~~~
See above in Kubernetes Issues section.
Slow Heat API requests (`Bug 1653088 <https://bugs.launchpad.net/fuel-ccp/+bug/1653088>`__)
-----------------------------------------------------------------------------------------
Symptoms
~~~~~~~~
Requests to the Heat API for regular requests took more than a minute of
time. Example of such request can be presented with `this
traceback <http://paste.openstack.org/show/593132/>`__ showing time
needed for listing details of the specific stack
Root cause
~~~~~~~~~~
Fuel-ccp team proposed that it might be a hidden race condition between
multiple Heat workers, and it's not yet clear where is this race
happening. This is still under debug
Workaround
~~~~~~~~~~
Set up number of Heat workers to 1 until real cause of this issue will
be found.
OpenStack VMs cannot fetch required metadata
--------------------------------------------
Symptoms
~~~~~~~~
OpenStack servers do not receive metadata with applications-specific
information. There is a following
`trace <http://paste.openstack.org/show/593337/>`__ in Heat logs.
Root cause
~~~~~~~~~~
Prior pushing metadata to OpenStack VMs, Heat is storing information
about stack under creation to its own database. Starting with OpenStack
Newton Heat works in so-called “convergence” mode by default, making
sure that Heat engine process several requests at one time. This
parallel task processing might end up with race conditions during DB
access.
Workaround
~~~~~~~~~~
Turn off Heat convergence engine::
convergence_engine=False
RPC timeouts during Heat resources validation
---------------------------------------------
Symptoms
~~~~~~~~
During Heat resources validation, this process is failed with the
following `trace <http://paste.openstack.org/show/593112/>`__ being
caught.
Root cause
~~~~~~~~~~
Initial assumption was that there might be too small
``rpc_response_timeout`` parameter being set up in Heat configuration
file. This parameter was set up to 10 minutes exactly like that was used
in MOS::
rpc_response_timeout = 600
After that was done, no more RPC timeouts were observed.
Solution
~~~~~~~~
Set up ``rpc_response_timeout = 600`` as thats tested value for
generations of MOS releases.
Overloaded Memcache service (`Bug 1653089 <https://bugs.launchpad.net/fuel-ccp/+bug/1653089>`__)
----------------------------------------------------------------------------------------------
Symptoms
~~~~~~~~
On 200 nodes with workloads deploying OpenStack is actively using cache
(``memcached``). At some point of time requests to ``memcached`` begin to be
processed really slow.
Root cause
~~~~~~~~~~
Memcache size is 256 MB by default in Fuel CCP, which is really small
size. That was a reason for great amount of retransmissions being
processed.
Solution
~~~~~~~~
Increase cache size up to 30G in ``fuel-ccp`` configuration file for huge or
loaded deployments::
configs:
memcached:
address: 0.0.0.0
port:
cont: 11211
ram: 30720
Incorrect status information for Neutron agents
-----------------------------------------------
Symptoms
~~~~~~~~
While asking for Neutron agents status through OpenStack client, all of
them are displayed in down state, although due to the environment
behaviour it does not seem to be true.
Root cause
~~~~~~~~~~
The root cause of this issue is hidden in OpenStackSDK refactoring that
caused various OSC networking commands to fail. The full discussion
regarding this issue can be found in `upstream
bug <https://bugs.launchpad.net/python-openstackclient/+bug/1652317>`__.
Workaround
~~~~~~~~~~
Use Neutron client directly to gather Neutron services statuses until
`bug <https://bugs.launchpad.net/python-openstackclient/+bug/1652317>`__
will be fixed.
Nova client not working (`Bug 1653075 <https://bugs.launchpad.net/fuel-ccp/+bug/1653075>`__)
------------------------------------------------------------------------------------------
Symptoms
~~~~~~~~
All commands running through Nova client end up with the `stack
trace <http://paste.openstack.org/show/593531/>`__.
Root cause
~~~~~~~~~~
In debug mode its seen that client tries to use HTTP protocol for the
Nova requests, e.g. http://compute.ccp.external:8443/v2.1/, although
HTTPS protocol is required for this client-server conversation.
Solution
~~~~~~~~
Add the following lines to
``ccp-repos/fuel-ccp-nova/service/files/nova.conf.j2`` file::
[DEFAULT]
secure_proxy_ssl_header = HTTP_X_FORWARDED_PROTO
Neutron server timeouts (`Bug 1653077 <https://bugs.launchpad.net/fuel-ccp/+bug/1653077>`__)
------------------------------------------------------------------------------------------
Symptoms
~~~~~~~~
Neutron server is not able to process requests to it with the following
error being caught in the Neutron logs:
`trace <http://paste.openstack.org/show/593483/>`__
Root cause
~~~~~~~~~~
Default values (that are used in ``fuel-ccp``) for the Neutron database pool
size, as well as max overflow parameter are not enough for the
environment with big enough load on Neutron API.
Solution
~~~~~~~~
Modify ``ccp-repos/fuel-ccp-neutron/service/files/neutron.conf.j2`` file to
contain the following configuration parameters::
[database]
max_pool_size = 30
max_overflow = 60
No access to OpenStack VM from tenant network
---------------------------------------------
Symptoms
~~~~~~~~
Some of the VMs representing the slices are not reachable via tenant
network. For example::
| 93b95c73-f849-4ffb-9108-63cf262d3a9f | cassandra_vm_0 |
ACTIVE | slice0-node162-net=11.62.0.8, 10.144.1.35 |
ubuntu-software-config-last |
root@node1:~# ssh -i .ssh/slace ubuntu@10.144.1.35
Connection closed by 10.144.1.35 port 22
It is unreachable from tenant network as well. For example from instance
``b1946719-b401-447d-8103-cc43b03b1481`` which has been spawned by the same
Heat stack on the same compute node (``node162``):
`http://paste.openstack.org/show/593486/ <http://paste.openstack.org/show/593486/>`__
Root cause and solution
~~~~~~~~~~~~~~~~~~~~~~~
Still under investigation. Root cause not clear yet. **This issue is blocking
running workloads against deployed slices.**
OpenStack services dont handle PXC pseudo-deadlocks
----------------------------------------------------
Symptoms
~~~~~~~~
When run in parallel, create operations of lots of resources were
failing with DBError saying that Percona Xtradb Cluster identified a
deadlock and transaction should be restarted.
Root cause
~~~~~~~~~~
oslo.db is responsible for wrapping errors received from DB into proper
classes so that services can restart transactions if similar errors
occur, but it didnt expect error in format that is being sent by
Percona. After we fixed this, we still experienced similar errors
because not all transactions that could be restarted were properly
decorated in Nova code.
Upstream issues
~~~~~~~~~~~~~~~
`Bug <https://bugs.launchpad.net/oslo.db/+bug/1648818>`__ has been
fixed by Roman Podolyakas
`CR <https://review.openstack.org/409194>`__ and
`backported <https://review.openstack.org/409679>`__ to Newton. It
fixes Percona deadlock error detection, but theres at least one place
in Nova to be fixed (TBD)
Live migration failed with live_migration_uri configuration
-------------------------------------------------------------
Symptoms
~~~~~~~~
With ``live_migration_uri`` configuration, live migrations fails because
one compute host cant connect to a libvirt on another host.
Root cause
~~~~~~~~~~
We cant specify which IP address to use in template in
``live_migration_uri``, so it was trying to use address from first
interface which happened to be in PXE network while libvirt listens in
private network. We couldnt use ``live_migration_inbound_addr`` which
would solve this problem because of a problem in upstream Nova.
Upstream issues
~~~~~~~~~~~~~~~
A `bug <https://bugs.launchpad.net/nova/+bug/1638625>`__ in Nova has
been `fixed <https://review.openstack.org/398956>`__ and
`backported <https://review.openstack.org/404810>`__ to Newton. We
`switched <https://review.openstack.org/407708>`__ to using
``live_migration_inbound_addr`` after that.
Contributors Contributors
============ ============