Testing of major openstack services with 1000

of compute nodes containers. Change-Id: Ifeab8cd5422c92a682ef8867691b53e2fe781edc
2016-06-02 13:36:18 -07:00 · 2016-06-02 13:36:18 -07:00 · 90e73f3c1f
commit 90e73f3c1f
parent 856e181f9f
8 changed files with 405 additions and 0 deletions
--- a/doc/source/test_plans/1000_nodes/plan.rst
+++ b/doc/source/test_plans/1000_nodes/plan.rst
@ -0,0 +1,204 @@
+.. _1000_nodes:
+
+===========================================================
+1000 Compute nodes resourse consumption/scalability testing
+===========================================================
+
+:status: **ready**
+:version: 1
+
+:Abstract:
+
+  This document describes a test plan for measuring OpenStack services
+  resources consumption along with scalability potential. It also provides
+  a results which could be used to find bottlenecks and/or potential pain
+  points for scaling standalone OpenStack services and OpenStack cloud itself.
+
+Test Plan
+=========
+
+Most of current OpenStack users wonder how it will behave on scale with a lot
+of compute nodes. This is a valid consern because OpenStack have a lot of
+services whose have different load and resources consumptions patterns.
+Most of the cloud operations are related to the two things: workloads placement
+and simple controlplane/dataplane management for them.
+So the main idea of this test plan is to create simple workloads (10-30k of
+VMs) and observe how core services working with them and what is resources
+consumption during active workloads placement and some time after that.
+
+Test Environment
+----------------
+
+Test assumes that each and every service will be monitored separately for
+resourses consuption using known techniques like atop/nagios/containerization
+and any other toolkits/solutions which will allow to:
+
+1. Measure CPU/RAM consuption of process/set of processes.
+2. Separate services and provide them as much as possible resourses available
+   to fulfill their needs.
+
+List of mandatory services for OpenStack testing:
+  nova-api
+  nova-scheduler
+  nova-conductor
+  nova-compute
+  glance-api
+  glance-registry
+  neutron-server
+  keystone-all
+
+List of replaceable but still mandatory services:
+  neutron-dhcp-agent
+  neutron-ovs-agent
+  rabbitmq
+  libvirtd
+  mysqld
+  openvswitch-vswitch
+
+List of optional service which may be omitted with performance decrease:
+  memcached
+
+List of optional service which may be omitted:
+  horizon
+
+Rally fits here as a pretty stable and reliable load runner. Monitoring could be
+done by any suitable software which will be able to provide a results in a form
+which allow to build graphs/visualize resources consuption to analyze them or
+do the analyzis automatically.
+
+Preparation
+^^^^^^^^^^^
+
+**Common preparation steps**
+
+To begin testing environment should have all the OpenStack services up and
+running. Of course they should be configured accordingly to the recommended
+settings from release and/or for your specific environment or use case.
+To have real world RPS/TPS/etc metrics all the services (inlcuding compute
+nodes) should be on the separate physical servers but again it depends on
+setup and requirements. For simplicity and testing only control plane the
+Fake compute driver could be used.
+
+Environment description
+^^^^^^^^^^^^^^^^^^^^^^^
+
+The environment description includes hardware specification of servers,
+network parameters, operation system and OpenStack deployment characteristics.
+
+Hardware
+~~~~~~~~
+
+This section contains list of all types of hardware nodes.
+
+-----------+-------+----------------------------------------------------+
+| Parameter | Value | Comments                                           |
+-----------+-------+----------------------------------------------------+
+| model     |       | e.g. Supermicro X9SRD-F                            |
+-----------+-------+----------------------------------------------------+
+| CPU       |       | e.g. 6 x Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz |
+-----------+-------+----------------------------------------------------+
+
+Network
+~~~~~~~
+
+This section contains list of interfaces and network parameters.
+For complicated cases this section may include topology diagram and switch
+parameters.
+
+------------------+-------+-------------------------+
+| Parameter        | Value | Comments                |
+------------------+-------+-------------------------+
+| card model       |       | e.g. Intel              |
+------------------+-------+-------------------------+
+| driver           |       | e.g. ixgbe              |
+------------------+-------+-------------------------+
+| speed            |       | e.g. 10G or 1G          |
+------------------+-------+-------------------------+
+
+Software
+~~~~~~~~
+
+This section describes installed software.
+
+-------------------+--------+---------------------------+
+| Parameter         | Value  | Comments                  |
+-------------------+--------+---------------------------+
+| OS                |        | e.g. Ubuntu 14.04.3       |
+-------------------+--------+---------------------------+
+| DB                |        | e.g. MySQL 5.6            |
+-------------------+--------+---------------------------+
+| MQ broker         |        | e.g. RabbitMQ v3.4.25     |
+-------------------+--------+---------------------------+
+| OpenStack release |        | e.g. Liberty              |
+-------------------+--------+---------------------------+
+
+
+Configuration
+~~~~~~~~~~~~~
+
+This section describes configuration of OpenStack and core services
+
+-------------------+-------------------------------+
+| Parameter         | File                          |
+-------------------+-------------------------------+
+| Keystone          |   ./results/keystone.conf     |
+-------------------+-------------------------------+
+| Nova-api          |   ./results/nova-api.conf     |
+-------------------+-------------------------------+
+| ...               +                               |
+-------------------+-------------------------------+
+
+
+
+Test Case 1: Resources consumption under severe load
+----------------------------------------------------
+
+
+Description
+^^^^^^^^^^^
+
+This test should spawn a number of instances in n parallel threads and along
+with that record all CPU/RAM metrics from all the OpenStack and core services
+like MQ brokers and DB server. As test itself is pretty long there is no need
+in very high test resolution. 1 measure per 5 seconds should be more than
+enough.
+
+Rally scenario that creates load of 50 parallel threads spawning VMs and
+calling for VMs list can be found in test plan folder and can be used for
+testing purposes. It could be modified to fit specific deployment needs.
+
+
+Parameters
+^^^^^^^^^^
+
+============================  ====================================================
+Parameter name                Value
+============================  ====================================================
+OpenStack release             Liberty, Mitaka
+
+Compute nodes amount          50,100,200,500,1000,2000,5000,10000
+
+Services configurations       Configuration for each OpenStack and core service
+============================  ====================================================
+
+List of performance metrics
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Test case result is presented as a weighted tree structure with operations
+as nodes and time spent on them as node weights for every control plane
+operation under the test. This information is automatically gathered in
+Ceilometer and can be gracefully transformed to the human-friendly report via
+OSprofiler.
+
+========  ===============  =================  =================================
+Priority  Value            Measurement Units  Description
+========  ===============  =================  =================================
+1         CPU load         Mhz                CPU load for each OpenStack
+                                              service
+2         RAM consumption  Gb                 RAM consumption for each
+                                              OpenStack service
+3         Instances amnt   Amount             Max number of instances spawned
+4         Operation time   milliseconds       Time spent for every instance
+                                              spawn
+========  ===============  =================  =================================
+
--- a/doc/source/test_plans/1000_nodes/rallytest.json
+++ b/doc/source/test_plans/1000_nodes/rallytest.json
@ -0,0 +1,58 @@
+{
+  "NovaServers.boot_and_list_server": [
+    {
+      "runner": {
+        "type": "constant", 
+        "concurrency": 50, 
+        "times": 20000
+      }, 
+      "args": {
+        "detailed": true, 
+        "flavor": {
+          "name": "m1.tiny"
+        }, 
+        "image": {
+          "name": "cirros"
+        }
+      }, 
+      "sla": {
+        "failure_rate": {
+          "max": 0
+        }
+      }, 
+      "context": {
+        "users": {
+          "project_domain": "default", 
+          "users_per_tenant": 2, 
+          "tenants": 200, 
+          "resource_management_workers": 30, 
+          "user_domain": "default"
+        }, 
+        "quotas": {
+          "nova": {
+            "ram": -1, 
+            "floating_ips": -1, 
+            "security_group_rules": -1, 
+            "instances": -1, 
+            "cores": -1, 
+            "security_groups": -1
+          }, 
+          "neutron": {
+            "subnet": -1, 
+            "network": -1, 
+            "port": -1
+          }
+        }, 
+        "network": {
+          "network_create_args": {
+            "tenant_id": "d51f243eba4d48d09a853e23aeb68774", 
+            "name": "c_rally_b7d5d2f5_OqPRUMD8"
+          }, 
+          "subnets_per_network": 1, 
+          "start_cidr": "1.0.0.0/21", 
+          "networks_per_tenant": 1
+        }
+      }
+    }
+  ]
+}
--- a/doc/source/test_plans/index.rst
+++ b/doc/source/test_plans/index.rst
@ -19,4 +19,5 @@ Test Plans
    container_cluster_systems/plan
    neutron_features/l3_ha/test_plan
    hardware_features/index
+    1000_nodes/plan

--- a/doc/source/test_results/1000_nodes/index.rst
+++ b/doc/source/test_results/1000_nodes/index.rst
@ -0,0 +1,141 @@
+Testing on scale of 1000 compute hosts.
+=======================================
+
+Environment setup
+-----------------
+
+Each and every service of OpenStack was placed into the container. Containers
+were placed mostly across 17 nodes. “Mostly” means some of them were placed on
+separate nodes to be able to get more resources if needed without limitations
+from other services. So after some initial assumptions these privileges was
+given to the following containers: rabbitmq, mysql, keystone, nova-api and
+neutron. Later after some observations only rabbitmq, mysql and keystone has
+kept these privileges. All other containers were places with some higher
+priorities but without dedicating additional hosts for them.
+
+List of OpenStack and Core services which were used in testing environment (in
+parentheses represents number of instances/containers):
+
+- nova-api(1)
+- nova-scheduler(8)
+- nova-conductor(1)
+- nova-compute(1000)
+- glance-api(1)
+- glance-registry(1)
+- neutron-server(1)
+- neutron-dhcp-agent(1)
+- neutron-ovs-agent(1000)
+- keystone-all(1)
+- rabbitmq(1)
+- mysqld(1)
+- memcached(1)
+- horizon(1)
+- libvirtd(1000)
+- openvswitch-vswitch(1000)
+
+
+Additional information
+----------------------
+
+We have 8 instances of nova-scheduler because it’s known as non-scalable inside
+of service (there is no workers/threads/etc inside of nova-scheduler).
+All other OpenStack services were run in quantity of 1.
+Each and every “Compute node” container has neutron-ovs-agent, libvirtd,
+nova-compute and openvswitch-vswitch inside. So we have 1000 of “Compute”
+containers across ~13 nodes.
+
+Prime aim of this simple testing is to check scalability of OpenStack control 
+and data plane services. Because of that RabbitMQ and MySQL were run in single
+node mode just to verify “essential load” and confirm that there is no issues
+with standalone nodes. Later we will run tests with Galera cluster and
+clustered RabbitMQ.
+
+We have used Mirantis MOS 8.0 (OpenStack Liberty release) official repo for
+creating containers with OpenStack services.
+
+There is set of tests run with fake compute driver for preliminary checks
+and overall load and placement verification. Later we modified original libvirt
+driver to only skip actual VM booting (spawn of qemu-kvm process). All other
+things related to the instance spawning are actually done.
+
+Glance was used with local file storage as a backend. CirrOS images were used
+for VM booting(~13Mb). Local disks of nodes/containers were used as a storage
+for VMs.
+
+Methodology
+-----------
+
+For simplicity we chose “boot and list VM” scenario in Rally with the
+following important parameters:
+
+- Total number of instances: 20000
+- Total number of workers: 50
+- Total number of networks: 200
+- Total number of tenants: 200
+- Total number of users: 400
+
+In 2-3 years probability of 1000 compute hosts to be added at the same
+moment (all of them in 10-15 seconds) is close to 0% therefore it's necessary
+to start all Compute containers and wait for ~5-10 minutes to provide neutron
+DVR with enough time to update all the agents to know each other.
+
+After that we start Rally test scenario. Because of nature of changes in
+nova-compute driver starting of a VM would be considered succeeded before
+security groups get applied to it (like vif_plugging_is_fatal=False). So this
+will lead to the increased Neutron server load and possibility of not all the
+rules got applied at the end of the testing. Although in our case it will
+create bigger load on Neutron which makes this test much heavier.
+Anyway we plan to do this test later excluding this particular  behavior and
+compare the results.
+
+In folder with this report you’ll find additional files with the test
+scenario, results and usage patterns observations.
+
+Here we would like to just point out some findings about resources consumptions
+by each and every service which could help with servers capacity planning. All
+servers had 2x Intel Xeon E5-2680v3.
+Here is top watermarks from different services under mentioned test load.
+
+
+Table 1. Services top watermarks
+
+-----------------+---------+----------+
+| Service         | CPU     |    RAM   |
+=================+=========+==========+
+| nova-api        |  13 GHz |  12.4 Gb |
+-----------------+---------+----------+
+| nova-scheduler* |  1 GHz  |   1.1 Gb |
+-----------------+---------+----------+
+| nova-conductor  |  30 GHz |   4.8 Gb |
+-----------------+---------+----------+
+| glance-api      | 160 MHz |   1.8 Gb |
+-----------------+---------+----------+
+| glance-registry | 300 MHz |   1.8 Gb |
+-----------------+---------+----------+
+| neutron-server  |  30 GHz |    20 Gb |
+-----------------+---------+----------+
+| keystone-all    |  14 GHz |   2.7 Gb |
+-----------------+---------+----------+
+| rabbitmq        |  21 GHz |    17 Gb |
+-----------------+---------+----------+
+| mysqld          | 1.9 GHz |   3.5 Gb |
+-----------------+---------+----------+
+| memcached       |  10 MHz |    27 Mb |
+-----------------+---------+----------+
+
+| * each of eight nova-scheduler processes.
+
+Very first assumptions on scale of 1000 nodes will be the following: it would
+be good to have 2 dedicated servers per component. Here is a list of components
+whose would require that: nova-conductor,nova-api, neutron-server, keystone.
+RabbitMQ and MySQL servers worked in standalone mode so clustering overhead
+will be added and they will consume much more resources than we already
+metered.
+
+
+Graphs:
+
+.. image:: stats1.png
+    :width: 1300px
+.. image:: stats2.png
+    :width: 1300px
--- a/doc/source/test_results/1000_nodes/stats1.png
+++ b/doc/source/test_results/1000_nodes/stats1.png
--- a/doc/source/test_results/1000_nodes/stats2.png
+++ b/doc/source/test_results/1000_nodes/stats2.png
--- a/doc/source/test_results/index.rst
+++ b/doc/source/test_results/index.rst
@ -19,3 +19,4 @@ Test Results
    neutron_features/index
    hardware_features/index
    provisioning/index
+    1000_nodes/index
--- a/raw_results/1000_nodes/report.html.gz
+++ b/raw_results/1000_nodes/report.html.gz