Change-Id: I4193a451bd767187579d9f68151556283fc1b0c5
8.5 KiB
Finding a Keystone bug while testing 20 node HA cloud performance at creating 400 VMs
(Contributed by Alexander Maretskiy, Mirantis)
Below we describe how we found a bug in Keystone and achieved 2x average performance increase at booting Nova servers after fixing that bug. Our initial goal was to measure performance the booting of a significant amount of servers on a cluster (running on a custom build of Mirantis OpenStack v5.1) and to ensure that this operation has reasonable performance and completes with no errors.
Goal
- Get data on how a cluster behaves when a huge amount of servers is started
- Get data on how good the neutron component is good in this case
Summary
- Creating 400 servers with configured networking
- Servers are being created simultaneously - 5 servers at the same time
Hardware
Having a real hardware lab with 20 nodes:
Vendor | SUPERMICRO SUPERSERVER |
CPU |
|
RAM | 32GB (4 x Samsung DDRIII 8GB) |
HDD | 1TB |
Cluster
This cluster was created via Fuel Dashboard interface.
Deployment | Custom build of Mirantis OpenStack v5.1 |
OpenStack release | Icehouse |
Operating System | Ubuntu 12.04.4 |
Mode | High availability |
Hypervisor | KVM |
Networking | Neutron with GRE segmentation |
Controller nodes | 3 |
Compute nodes | 17 |
Rally
Version
For this test case, we use custom Rally with the following patch:
https://review.openstack.org/#/c/96300/
Deployment
Rally was deployed for cluster using ExistingCloud type of deployment.
Server flavor
$ nova flavor-show ram64
+----------------------------+--------------------------------------+
| Property | Value |
+----------------------------+--------------------------------------+
| OS-FLV-DISABLED:disabled | False |
| OS-FLV-EXT-DATA:ephemeral | 0 |
| disk | 0 |
| extra_specs | {} |
| id | 2e46aba0-9e7f-4572-8b0a-b12cfe7e06a1 |
| name | ram64 |
| os-flavor-access:is_public | True |
| ram | 64 |
| rxtx_factor | 1.0 |
| swap | |
| vcpus | 1 |
+----------------------------+--------------------------------------+
Server image
$ glance image-show d1c116f4-3c38-4aa6-8fa1-f7a28c4e72a6
+------------------+--------------------------------------+
| Property | Value |
+------------------+--------------------------------------+
| checksum | 053ad369d58aa98afb1d355aa16b0663 |
| container_format | bare |
| created_at | 2018-01-09T06:23:18Z |
| disk_format | qcow2 |
| id | d1c116f4-3c38-4aa6-8fa1-f7a28c4e72a6 |
| min_disk | 0 |
| min_ram | 0 |
| name | TestVM |
| owner | 01cb845eee6449cea4381865a1270736 |
| protected | False |
| size | 5254208 |
| status | active |
| tags | [] |
| updated_at | 2018-01-09T06:23:18Z |
| virtual_size | None |
| visibility | public |
+------------------+--------------------------------------+
Task configuration file (in JSON format):
{
"NovaServers.boot_server": [
{
"args": {
"flavor": {
"name": "ram64"
},
"image": {
"name": "TestVM"
}
},
"runner": {
"type": "constant",
"concurrency": 5,
"times": 400
},
"context": {
"neutron_network": {
"network_ip_version": 4
},
"users": {
"concurrent": 30,
"users_per_tenant": 5,
"tenants": 5
},
"quotas": {
"neutron": {
"subnet": -1,
"port": -1,
"network": -1,
"router": -1
}
}
}
}
]
}
The only difference between first and second run is that runner.times for first time was set to 500
Results
First time - a bug was found:
Starting from 142 server, we have error from novaclient: Error <class 'novaclient.exceptions.Unauthorized'>: Unauthorized (HTTP 401).
That is how a bug in Keystone was found.
action | min (sec) | avg (sec) | max (sec) | 90 percentile | 95 percentile | success | count |
nova.boot_server total | 6.507 6.507 | 17.402 17.402 | 100.303 100.303 | 39.222 39.222 | 50.134 50.134 | 26.8% 26.8% | 500 500 |
Second run, with bugfix:
After a patch was applied (using RPC instead of neutron client in metadata agent), we got 100% success and 2x improved average performance:
action | min (sec) | avg (sec) | max (sec) | 90 percentile | 95 percentile | success | count |
nova.boot_server total | 5.031 5.031 | 8.008 8.008 | 14.093 14.093 | 9.616 9.616 | 9.716 9.716 | 100.0% 100.0% | 400 400 |