Testing of major openstack services with 1000
of compute nodes containers. Change-Id: Ifeab8cd5422c92a682ef8867691b53e2fe781edc
This commit is contained in:
parent
856e181f9f
commit
90e73f3c1f
204
doc/source/test_plans/1000_nodes/plan.rst
Normal file
204
doc/source/test_plans/1000_nodes/plan.rst
Normal file
@ -0,0 +1,204 @@
|
||||
.. _1000_nodes:
|
||||
|
||||
===========================================================
|
||||
1000 Compute nodes resourse consumption/scalability testing
|
||||
===========================================================
|
||||
|
||||
:status: **ready**
|
||||
:version: 1
|
||||
|
||||
:Abstract:
|
||||
|
||||
This document describes a test plan for measuring OpenStack services
|
||||
resources consumption along with scalability potential. It also provides
|
||||
a results which could be used to find bottlenecks and/or potential pain
|
||||
points for scaling standalone OpenStack services and OpenStack cloud itself.
|
||||
|
||||
Test Plan
|
||||
=========
|
||||
|
||||
Most of current OpenStack users wonder how it will behave on scale with a lot
|
||||
of compute nodes. This is a valid consern because OpenStack have a lot of
|
||||
services whose have different load and resources consumptions patterns.
|
||||
Most of the cloud operations are related to the two things: workloads placement
|
||||
and simple controlplane/dataplane management for them.
|
||||
So the main idea of this test plan is to create simple workloads (10-30k of
|
||||
VMs) and observe how core services working with them and what is resources
|
||||
consumption during active workloads placement and some time after that.
|
||||
|
||||
Test Environment
|
||||
----------------
|
||||
|
||||
Test assumes that each and every service will be monitored separately for
|
||||
resourses consuption using known techniques like atop/nagios/containerization
|
||||
and any other toolkits/solutions which will allow to:
|
||||
|
||||
1. Measure CPU/RAM consuption of process/set of processes.
|
||||
2. Separate services and provide them as much as possible resourses available
|
||||
to fulfill their needs.
|
||||
|
||||
List of mandatory services for OpenStack testing:
|
||||
nova-api
|
||||
nova-scheduler
|
||||
nova-conductor
|
||||
nova-compute
|
||||
glance-api
|
||||
glance-registry
|
||||
neutron-server
|
||||
keystone-all
|
||||
|
||||
List of replaceable but still mandatory services:
|
||||
neutron-dhcp-agent
|
||||
neutron-ovs-agent
|
||||
rabbitmq
|
||||
libvirtd
|
||||
mysqld
|
||||
openvswitch-vswitch
|
||||
|
||||
List of optional service which may be omitted with performance decrease:
|
||||
memcached
|
||||
|
||||
List of optional service which may be omitted:
|
||||
horizon
|
||||
|
||||
Rally fits here as a pretty stable and reliable load runner. Monitoring could be
|
||||
done by any suitable software which will be able to provide a results in a form
|
||||
which allow to build graphs/visualize resources consuption to analyze them or
|
||||
do the analyzis automatically.
|
||||
|
||||
Preparation
|
||||
^^^^^^^^^^^
|
||||
|
||||
**Common preparation steps**
|
||||
|
||||
To begin testing environment should have all the OpenStack services up and
|
||||
running. Of course they should be configured accordingly to the recommended
|
||||
settings from release and/or for your specific environment or use case.
|
||||
To have real world RPS/TPS/etc metrics all the services (inlcuding compute
|
||||
nodes) should be on the separate physical servers but again it depends on
|
||||
setup and requirements. For simplicity and testing only control plane the
|
||||
Fake compute driver could be used.
|
||||
|
||||
Environment description
|
||||
^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
The environment description includes hardware specification of servers,
|
||||
network parameters, operation system and OpenStack deployment characteristics.
|
||||
|
||||
Hardware
|
||||
~~~~~~~~
|
||||
|
||||
This section contains list of all types of hardware nodes.
|
||||
|
||||
+-----------+-------+----------------------------------------------------+
|
||||
| Parameter | Value | Comments |
|
||||
+-----------+-------+----------------------------------------------------+
|
||||
| model | | e.g. Supermicro X9SRD-F |
|
||||
+-----------+-------+----------------------------------------------------+
|
||||
| CPU | | e.g. 6 x Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz |
|
||||
+-----------+-------+----------------------------------------------------+
|
||||
|
||||
Network
|
||||
~~~~~~~
|
||||
|
||||
This section contains list of interfaces and network parameters.
|
||||
For complicated cases this section may include topology diagram and switch
|
||||
parameters.
|
||||
|
||||
+------------------+-------+-------------------------+
|
||||
| Parameter | Value | Comments |
|
||||
+------------------+-------+-------------------------+
|
||||
| card model | | e.g. Intel |
|
||||
+------------------+-------+-------------------------+
|
||||
| driver | | e.g. ixgbe |
|
||||
+------------------+-------+-------------------------+
|
||||
| speed | | e.g. 10G or 1G |
|
||||
+------------------+-------+-------------------------+
|
||||
|
||||
Software
|
||||
~~~~~~~~
|
||||
|
||||
This section describes installed software.
|
||||
|
||||
+-------------------+--------+---------------------------+
|
||||
| Parameter | Value | Comments |
|
||||
+-------------------+--------+---------------------------+
|
||||
| OS | | e.g. Ubuntu 14.04.3 |
|
||||
+-------------------+--------+---------------------------+
|
||||
| DB | | e.g. MySQL 5.6 |
|
||||
+-------------------+--------+---------------------------+
|
||||
| MQ broker | | e.g. RabbitMQ v3.4.25 |
|
||||
+-------------------+--------+---------------------------+
|
||||
| OpenStack release | | e.g. Liberty |
|
||||
+-------------------+--------+---------------------------+
|
||||
|
||||
|
||||
Configuration
|
||||
~~~~~~~~~~~~~
|
||||
|
||||
This section describes configuration of OpenStack and core services
|
||||
|
||||
+-------------------+-------------------------------+
|
||||
| Parameter | File |
|
||||
+-------------------+-------------------------------+
|
||||
| Keystone | ./results/keystone.conf |
|
||||
+-------------------+-------------------------------+
|
||||
| Nova-api | ./results/nova-api.conf |
|
||||
+-------------------+-------------------------------+
|
||||
| ... + |
|
||||
+-------------------+-------------------------------+
|
||||
|
||||
|
||||
|
||||
Test Case 1: Resources consumption under severe load
|
||||
----------------------------------------------------
|
||||
|
||||
|
||||
Description
|
||||
^^^^^^^^^^^
|
||||
|
||||
This test should spawn a number of instances in n parallel threads and along
|
||||
with that record all CPU/RAM metrics from all the OpenStack and core services
|
||||
like MQ brokers and DB server. As test itself is pretty long there is no need
|
||||
in very high test resolution. 1 measure per 5 seconds should be more than
|
||||
enough.
|
||||
|
||||
Rally scenario that creates load of 50 parallel threads spawning VMs and
|
||||
calling for VMs list can be found in test plan folder and can be used for
|
||||
testing purposes. It could be modified to fit specific deployment needs.
|
||||
|
||||
|
||||
Parameters
|
||||
^^^^^^^^^^
|
||||
|
||||
============================ ====================================================
|
||||
Parameter name Value
|
||||
============================ ====================================================
|
||||
OpenStack release Liberty, Mitaka
|
||||
|
||||
Compute nodes amount 50,100,200,500,1000,2000,5000,10000
|
||||
|
||||
Services configurations Configuration for each OpenStack and core service
|
||||
============================ ====================================================
|
||||
|
||||
List of performance metrics
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Test case result is presented as a weighted tree structure with operations
|
||||
as nodes and time spent on them as node weights for every control plane
|
||||
operation under the test. This information is automatically gathered in
|
||||
Ceilometer and can be gracefully transformed to the human-friendly report via
|
||||
OSprofiler.
|
||||
|
||||
======== =============== ================= =================================
|
||||
Priority Value Measurement Units Description
|
||||
======== =============== ================= =================================
|
||||
1 CPU load Mhz CPU load for each OpenStack
|
||||
service
|
||||
2 RAM consumption Gb RAM consumption for each
|
||||
OpenStack service
|
||||
3 Instances amnt Amount Max number of instances spawned
|
||||
4 Operation time milliseconds Time spent for every instance
|
||||
spawn
|
||||
======== =============== ================= =================================
|
||||
|
58
doc/source/test_plans/1000_nodes/rallytest.json
Normal file
58
doc/source/test_plans/1000_nodes/rallytest.json
Normal file
@ -0,0 +1,58 @@
|
||||
{
|
||||
"NovaServers.boot_and_list_server": [
|
||||
{
|
||||
"runner": {
|
||||
"type": "constant",
|
||||
"concurrency": 50,
|
||||
"times": 20000
|
||||
},
|
||||
"args": {
|
||||
"detailed": true,
|
||||
"flavor": {
|
||||
"name": "m1.tiny"
|
||||
},
|
||||
"image": {
|
||||
"name": "cirros"
|
||||
}
|
||||
},
|
||||
"sla": {
|
||||
"failure_rate": {
|
||||
"max": 0
|
||||
}
|
||||
},
|
||||
"context": {
|
||||
"users": {
|
||||
"project_domain": "default",
|
||||
"users_per_tenant": 2,
|
||||
"tenants": 200,
|
||||
"resource_management_workers": 30,
|
||||
"user_domain": "default"
|
||||
},
|
||||
"quotas": {
|
||||
"nova": {
|
||||
"ram": -1,
|
||||
"floating_ips": -1,
|
||||
"security_group_rules": -1,
|
||||
"instances": -1,
|
||||
"cores": -1,
|
||||
"security_groups": -1
|
||||
},
|
||||
"neutron": {
|
||||
"subnet": -1,
|
||||
"network": -1,
|
||||
"port": -1
|
||||
}
|
||||
},
|
||||
"network": {
|
||||
"network_create_args": {
|
||||
"tenant_id": "d51f243eba4d48d09a853e23aeb68774",
|
||||
"name": "c_rally_b7d5d2f5_OqPRUMD8"
|
||||
},
|
||||
"subnets_per_network": 1,
|
||||
"start_cidr": "1.0.0.0/21",
|
||||
"networks_per_tenant": 1
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
@ -19,4 +19,5 @@ Test Plans
|
||||
container_cluster_systems/plan
|
||||
neutron_features/l3_ha/test_plan
|
||||
hardware_features/index
|
||||
1000_nodes/plan
|
||||
|
||||
|
141
doc/source/test_results/1000_nodes/index.rst
Normal file
141
doc/source/test_results/1000_nodes/index.rst
Normal file
@ -0,0 +1,141 @@
|
||||
Testing on scale of 1000 compute hosts.
|
||||
=======================================
|
||||
|
||||
Environment setup
|
||||
-----------------
|
||||
|
||||
Each and every service of OpenStack was placed into the container. Containers
|
||||
were placed mostly across 17 nodes. “Mostly” means some of them were placed on
|
||||
separate nodes to be able to get more resources if needed without limitations
|
||||
from other services. So after some initial assumptions these privileges was
|
||||
given to the following containers: rabbitmq, mysql, keystone, nova-api and
|
||||
neutron. Later after some observations only rabbitmq, mysql and keystone has
|
||||
kept these privileges. All other containers were places with some higher
|
||||
priorities but without dedicating additional hosts for them.
|
||||
|
||||
List of OpenStack and Core services which were used in testing environment (in
|
||||
parentheses represents number of instances/containers):
|
||||
|
||||
- nova-api(1)
|
||||
- nova-scheduler(8)
|
||||
- nova-conductor(1)
|
||||
- nova-compute(1000)
|
||||
- glance-api(1)
|
||||
- glance-registry(1)
|
||||
- neutron-server(1)
|
||||
- neutron-dhcp-agent(1)
|
||||
- neutron-ovs-agent(1000)
|
||||
- keystone-all(1)
|
||||
- rabbitmq(1)
|
||||
- mysqld(1)
|
||||
- memcached(1)
|
||||
- horizon(1)
|
||||
- libvirtd(1000)
|
||||
- openvswitch-vswitch(1000)
|
||||
|
||||
|
||||
Additional information
|
||||
----------------------
|
||||
|
||||
We have 8 instances of nova-scheduler because it’s known as non-scalable inside
|
||||
of service (there is no workers/threads/etc inside of nova-scheduler).
|
||||
All other OpenStack services were run in quantity of 1.
|
||||
Each and every “Compute node” container has neutron-ovs-agent, libvirtd,
|
||||
nova-compute and openvswitch-vswitch inside. So we have 1000 of “Compute”
|
||||
containers across ~13 nodes.
|
||||
|
||||
Prime aim of this simple testing is to check scalability of OpenStack control
|
||||
and data plane services. Because of that RabbitMQ and MySQL were run in single
|
||||
node mode just to verify “essential load” and confirm that there is no issues
|
||||
with standalone nodes. Later we will run tests with Galera cluster and
|
||||
clustered RabbitMQ.
|
||||
|
||||
We have used Mirantis MOS 8.0 (OpenStack Liberty release) official repo for
|
||||
creating containers with OpenStack services.
|
||||
|
||||
There is set of tests run with fake compute driver for preliminary checks
|
||||
and overall load and placement verification. Later we modified original libvirt
|
||||
driver to only skip actual VM booting (spawn of qemu-kvm process). All other
|
||||
things related to the instance spawning are actually done.
|
||||
|
||||
Glance was used with local file storage as a backend. CirrOS images were used
|
||||
for VM booting(~13Mb). Local disks of nodes/containers were used as a storage
|
||||
for VMs.
|
||||
|
||||
Methodology
|
||||
-----------
|
||||
|
||||
For simplicity we chose “boot and list VM” scenario in Rally with the
|
||||
following important parameters:
|
||||
|
||||
- Total number of instances: 20000
|
||||
- Total number of workers: 50
|
||||
- Total number of networks: 200
|
||||
- Total number of tenants: 200
|
||||
- Total number of users: 400
|
||||
|
||||
In 2-3 years probability of 1000 compute hosts to be added at the same
|
||||
moment (all of them in 10-15 seconds) is close to 0% therefore it's necessary
|
||||
to start all Compute containers and wait for ~5-10 minutes to provide neutron
|
||||
DVR with enough time to update all the agents to know each other.
|
||||
|
||||
After that we start Rally test scenario. Because of nature of changes in
|
||||
nova-compute driver starting of a VM would be considered succeeded before
|
||||
security groups get applied to it (like vif_plugging_is_fatal=False). So this
|
||||
will lead to the increased Neutron server load and possibility of not all the
|
||||
rules got applied at the end of the testing. Although in our case it will
|
||||
create bigger load on Neutron which makes this test much heavier.
|
||||
Anyway we plan to do this test later excluding this particular behavior and
|
||||
compare the results.
|
||||
|
||||
In folder with this report you’ll find additional files with the test
|
||||
scenario, results and usage patterns observations.
|
||||
|
||||
Here we would like to just point out some findings about resources consumptions
|
||||
by each and every service which could help with servers capacity planning. All
|
||||
servers had 2x Intel Xeon E5-2680v3.
|
||||
Here is top watermarks from different services under mentioned test load.
|
||||
|
||||
|
||||
Table 1. Services top watermarks
|
||||
|
||||
+-----------------+---------+----------+
|
||||
| Service | CPU | RAM |
|
||||
+=================+=========+==========+
|
||||
| nova-api | 13 GHz | 12.4 Gb |
|
||||
+-----------------+---------+----------+
|
||||
| nova-scheduler* | 1 GHz | 1.1 Gb |
|
||||
+-----------------+---------+----------+
|
||||
| nova-conductor | 30 GHz | 4.8 Gb |
|
||||
+-----------------+---------+----------+
|
||||
| glance-api | 160 MHz | 1.8 Gb |
|
||||
+-----------------+---------+----------+
|
||||
| glance-registry | 300 MHz | 1.8 Gb |
|
||||
+-----------------+---------+----------+
|
||||
| neutron-server | 30 GHz | 20 Gb |
|
||||
+-----------------+---------+----------+
|
||||
| keystone-all | 14 GHz | 2.7 Gb |
|
||||
+-----------------+---------+----------+
|
||||
| rabbitmq | 21 GHz | 17 Gb |
|
||||
+-----------------+---------+----------+
|
||||
| mysqld | 1.9 GHz | 3.5 Gb |
|
||||
+-----------------+---------+----------+
|
||||
| memcached | 10 MHz | 27 Mb |
|
||||
+-----------------+---------+----------+
|
||||
|
||||
| * each of eight nova-scheduler processes.
|
||||
|
||||
Very first assumptions on scale of 1000 nodes will be the following: it would
|
||||
be good to have 2 dedicated servers per component. Here is a list of components
|
||||
whose would require that: nova-conductor,nova-api, neutron-server, keystone.
|
||||
RabbitMQ and MySQL servers worked in standalone mode so clustering overhead
|
||||
will be added and they will consume much more resources than we already
|
||||
metered.
|
||||
|
||||
|
||||
Graphs:
|
||||
|
||||
.. image:: stats1.png
|
||||
:width: 1300px
|
||||
.. image:: stats2.png
|
||||
:width: 1300px
|
BIN
doc/source/test_results/1000_nodes/stats1.png
Normal file
BIN
doc/source/test_results/1000_nodes/stats1.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 1.3 MiB |
BIN
doc/source/test_results/1000_nodes/stats2.png
Normal file
BIN
doc/source/test_results/1000_nodes/stats2.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 1.4 MiB |
@ -19,3 +19,4 @@ Test Results
|
||||
neutron_features/index
|
||||
hardware_features/index
|
||||
provisioning/index
|
||||
1000_nodes/index
|
||||
|
BIN
raw_results/1000_nodes/report.html.gz
Normal file
BIN
raw_results/1000_nodes/report.html.gz
Normal file
Binary file not shown.
Loading…
Reference in New Issue
Block a user