New version of reliability test plan

The new version defines MTTR, service downtime and
operation degradation metrics. It is targeted to be executed
with Rally and os-faults library.

Change-Id: I31f74a41b1b2e725986e4593fde92768f0237aa4
This commit is contained in:
Ilya Shakhat 2016-09-15 12:34:36 +03:00 committed by Ilya Shakhat
parent 20d28676ec
commit e1ddea22ea
7 changed files with 327 additions and 0 deletions

View File

@ -0,0 +1,14 @@
.. raw:: pdf
PageBreak oneColumn
=============================
OpenStack reliability testing
=============================
.. toctree::
:maxdepth: 3
:glob:
*/plan
*/index

View File

@ -0,0 +1,49 @@
# Licensed under the Apache License, Version 2.0 (the "License"); you may
# not use this file except in compliance with the License. You may obtain
# a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
# License for the specific language governing permissions and limitations
# under the License.
import os_faults
from rally.common import logging
from rally import consts
from rally.task import hook
LOG = logging.getLogger(__name__)
@hook.configure(name="fault_injection")
class FaultInjectionHook(hook.Hook):
"""Performs fault injection."""
CONFIG_SCHEMA = {
"type": "object",
"$schema": consts.JSON_SCHEMA,
"properties": {
"action": {"type": "string"},
},
"required": [
"action",
],
"additionalProperties": False,
}
def run(self):
LOG.debug("Injecting fault: %s", self.config["action"])
injector = os_faults.connect()
try:
os_faults.human_api(injector, self.config["action"])
self.set_status(consts.HookStatus.SUCCESS)
except Exception as e:
self.set_status(consts.HookStatus.FAILED)
self.set_error(exception_name=type(e),
description='Fault injection failure',
details=str(e))

Binary file not shown.

After

Width:  |  Height:  |  Size: 40 KiB

View File

@ -0,0 +1,264 @@
.. _reliability_testing:
==========================================
OpenStack reliability testing. Version 2.0
==========================================
:status: **draft**
:version: 2.0
:Abstract:
This test plan describes methodology for reliability testing of OpenStack.
:Conventions:
- **Recovery period** - the period of time after the fault when service
performance degrades
- **MTTR** - mean time to recover service performance after the fault.
- **Service Downtime** - the time when service was not available and number
of errors is more than defined by SLA.
- **Operation Degradation** - the difference in operation performance
compared with performance when service operates normally.
- **Fault injection** - the function that emulates failure in software or
hardware.
- **Service hang** - fault that emulates hanging service by
sending `SIGSTOP` and `SIGCONT` POSIX signals to service process(es).
- **Service crash** - fault that emulates abnormal program termination
by sending `SIGKILL` signal to service process(es).
- **Node crash** - fault that emulates unexpected power outage of hardware.
- **Network partition** - fault that result in connectivity loss between
service components running on different hardware nodes; used to toggle
split-brain conditions in HA service.
- **Network flapping** - fault that emulates disconnection of network
interface on hardware node or switch.
Test Plan
=========
Test Environment
----------------
Preparation
^^^^^^^^^^^
This test plan is executed against existing OpenStack cloud.
Measurements can be done with the tool that:
* is able to inject faults into existing OpenStack cloud at specified moment
of execution;
* collects duration of single operations and errors;
* calculates metrics specified in the test plan (e.g. MTTR, Service Downtime).
Environment description
^^^^^^^^^^^^^^^^^^^^^^^
The environment description includes hardware specification of servers,
network parameters, operation system and OpenStack deployment characteristics.
Hardware
~~~~~~~~
This section contains list of all types of hardware nodes.
+-----------+-------+----------------------------------------------------+
| Parameter | Value | Comments |
+-----------+-------+----------------------------------------------------+
| model | | e.g. Supermicro X9SRD-F |
+-----------+-------+----------------------------------------------------+
| CPU | | e.g. 6 x Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz |
+-----------+-------+----------------------------------------------------+
| role | | e.g. compute or network |
+-----------+-------+----------------------------------------------------+
Network
~~~~~~~
This section contains list of interfaces and network parameters.
For complicated cases this section may include topology diagram and switch
parameters.
+------------------+-------+-------------------------+
| Parameter | Value | Comments |
+------------------+-------+-------------------------+
| network role | | e.g. provider or public |
+------------------+-------+-------------------------+
| card model | | e.g. Intel |
+------------------+-------+-------------------------+
| driver | | e.g. ixgbe |
+------------------+-------+-------------------------+
| speed | | e.g. 10G or 1G |
+------------------+-------+-------------------------+
| MTU | | e.g. 9000 |
+------------------+-------+-------------------------+
| offloading modes | | e.g. default |
+------------------+-------+-------------------------+
Software
~~~~~~~~
This section describes installed software.
+-----------------+-------+---------------------------+
| Parameter | Value | Comments |
+-----------------+-------+---------------------------+
| OS | | e.g. Ubuntu 14.04.3 |
+-----------------+-------+---------------------------+
| OpenStack | | e.g. Liberty |
+-----------------+-------+---------------------------+
| Hypervisor | | e.g. KVM |
+-----------------+-------+---------------------------+
| Neutron plugin | | e.g. ML2 + OVS |
+-----------------+-------+---------------------------+
| L2 segmentation | | e.g. VLAN or VxLAN or GRE |
+-----------------+-------+---------------------------+
| virtual routers | | e.g. legacy or HA or DVR |
+-----------------+-------+---------------------------+
Test Case: Reliability Metrics Calculation
------------------------------------------
Description
^^^^^^^^^^^
The test case is performed by running a specific OpenStack operation with
injected fault. Every test is executed several times to collect more reliable
statistical data.
Parameters
^^^^^^^^^^
The test case is configured with:
* OpenStack operation that is tested (e.g. *network creation*);
* fault that is injected into execution pipeline (e.g. *service restart*);
Types of faults:
* Service-related:
* restart - service is stopped gracefully and then started;
* kill - service is terminated abruptly by OS;
* unplug/plug - service network partitioning.
* Node-related:
* reboot - node is rebooted gracefully;
* reset - cold restart of the node with potential data loss;
* poweroff/poweron - node is switched off and on;
* connect/disconnect - node's network interface is flapped.
List of performance metrics
^^^^^^^^^^^^^^^^^^^^^^^^^^^
A particular fault may affect operations in different ways. Operations
may fail with error and we can count such errors and estimate how long the
downtime was. Operations may degrade in performance and we can compare
performance with base numbers. Also we can estimate time while the
performance was degraded.
If both errors and performance degradation are observed the image could
look like the following:
.. image:: hypothesis.png
Here the light blue line shows the mean operation duration, orange area is
where errors are observed and yellow where the performance is low.
Overall the following metrics need to be collected:
.. list-table::
:header-rows: 1
*
- Priority
- Value
- Measurement Unit
- Description
*
- 1
- Service downtime
- sec
- How long the service was not available and operations were in error
state.
*
- 1
- MTTR
- sec
- How long does it takes to recover service performance after the failure.
*
- 1
- Operation Degradation
- sec
- the mean of difference in operation performance during recovery period
and operation performance when service operates normally.
*
- 1
- Operation Degradation Ratio
- sec
- the ratio between operation performance during recovery period and
operation performance when service operates normally.
The final report may also contain one or more charts that show operation
behavior during the test.
Tools
=====
Rally + os-faults
-----------------
This test plan can be executed with `Rally`_ tool. Rally can report
duration of individual operations and report errors. Rally `Hooks` features
allows to call external code at specified moments of scenario execution.
`os-faults`_ library provides a generic way to inject faults into OpenStack
cloud. It supports both service and node based operations.
The integration between Rally and os-faults is implemented as Rally hooks
plugin: :download:`fault_injection.py <code/rally_plugins/fault_injection.py>`
Calculations
^^^^^^^^^^^^
Metrics calculations are based on raw data collected from Rally (Rally json
output). The raw data contains list of iterations with duration of each
iteration. If some operation failed the iteration contains error field. Also
raw data contains hook information, when it was started and its execution
status.
The period of scenario execution before the hook is interpreted as the
baseline. It is used to measure operation's baseline mean and deviation.
`Service downtime` is calculated as time interval between the first and the
last errors. The precision of calculation is average distance between the
last succeed operation and the first error, and the last error and the next
succeed operation.
To find the recovery period we first calculate the mean duration of
consequent operations with sliding window. The period is treated as
`Recovery period` when mean operation duration is significantly more than
the mean operation duration in the baseline. `Operation degradation` is
calculated as difference between mean of operation duration during Recovery
period and the baseline's. `Operation ratio` is the ratio between mean of
operation duration during Recovery period and the baseline's.
.. references:
.. _Rally: https://rally.readthedocs.io/
.. _os-faults: https://os-faults.readthedocs.io/