New version of reliability test plan
The new version defines MTTR, service downtime and operation degradation metrics. It is targeted to be executed with Rally and os-faults library. Change-Id: I31f74a41b1b2e725986e4593fde92768f0237aa4
This commit is contained in:
parent
20d28676ec
commit
e1ddea22ea
14
doc/source/test_plans/reliability/index.rst
Normal file
14
doc/source/test_plans/reliability/index.rst
Normal file
@ -0,0 +1,14 @@
|
|||||||
|
.. raw:: pdf
|
||||||
|
|
||||||
|
PageBreak oneColumn
|
||||||
|
|
||||||
|
=============================
|
||||||
|
OpenStack reliability testing
|
||||||
|
=============================
|
||||||
|
|
||||||
|
.. toctree::
|
||||||
|
:maxdepth: 3
|
||||||
|
:glob:
|
||||||
|
|
||||||
|
*/plan
|
||||||
|
*/index
|
@ -0,0 +1,49 @@
|
|||||||
|
# Licensed under the Apache License, Version 2.0 (the "License"); you may
|
||||||
|
# not use this file except in compliance with the License. You may obtain
|
||||||
|
# a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
|
||||||
|
# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
|
||||||
|
# License for the specific language governing permissions and limitations
|
||||||
|
# under the License.
|
||||||
|
|
||||||
|
import os_faults
|
||||||
|
|
||||||
|
from rally.common import logging
|
||||||
|
from rally import consts
|
||||||
|
from rally.task import hook
|
||||||
|
|
||||||
|
LOG = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
@hook.configure(name="fault_injection")
|
||||||
|
class FaultInjectionHook(hook.Hook):
|
||||||
|
"""Performs fault injection."""
|
||||||
|
|
||||||
|
CONFIG_SCHEMA = {
|
||||||
|
"type": "object",
|
||||||
|
"$schema": consts.JSON_SCHEMA,
|
||||||
|
"properties": {
|
||||||
|
"action": {"type": "string"},
|
||||||
|
},
|
||||||
|
"required": [
|
||||||
|
"action",
|
||||||
|
],
|
||||||
|
"additionalProperties": False,
|
||||||
|
}
|
||||||
|
|
||||||
|
def run(self):
|
||||||
|
LOG.debug("Injecting fault: %s", self.config["action"])
|
||||||
|
injector = os_faults.connect()
|
||||||
|
|
||||||
|
try:
|
||||||
|
os_faults.human_api(injector, self.config["action"])
|
||||||
|
self.set_status(consts.HookStatus.SUCCESS)
|
||||||
|
except Exception as e:
|
||||||
|
self.set_status(consts.HookStatus.FAILED)
|
||||||
|
self.set_error(exception_name=type(e),
|
||||||
|
description='Fault injection failure',
|
||||||
|
details=str(e))
|
BIN
doc/source/test_plans/reliability/version_2/hypothesis.png
Normal file
BIN
doc/source/test_plans/reliability/version_2/hypothesis.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 40 KiB |
264
doc/source/test_plans/reliability/version_2/plan.rst
Normal file
264
doc/source/test_plans/reliability/version_2/plan.rst
Normal file
@ -0,0 +1,264 @@
|
|||||||
|
.. _reliability_testing:
|
||||||
|
|
||||||
|
==========================================
|
||||||
|
OpenStack reliability testing. Version 2.0
|
||||||
|
==========================================
|
||||||
|
|
||||||
|
:status: **draft**
|
||||||
|
:version: 2.0
|
||||||
|
|
||||||
|
:Abstract:
|
||||||
|
|
||||||
|
This test plan describes methodology for reliability testing of OpenStack.
|
||||||
|
|
||||||
|
:Conventions:
|
||||||
|
|
||||||
|
- **Recovery period** - the period of time after the fault when service
|
||||||
|
performance degrades
|
||||||
|
|
||||||
|
- **MTTR** - mean time to recover service performance after the fault.
|
||||||
|
|
||||||
|
- **Service Downtime** - the time when service was not available and number
|
||||||
|
of errors is more than defined by SLA.
|
||||||
|
|
||||||
|
- **Operation Degradation** - the difference in operation performance
|
||||||
|
compared with performance when service operates normally.
|
||||||
|
|
||||||
|
- **Fault injection** - the function that emulates failure in software or
|
||||||
|
hardware.
|
||||||
|
|
||||||
|
- **Service hang** - fault that emulates hanging service by
|
||||||
|
sending `SIGSTOP` and `SIGCONT` POSIX signals to service process(es).
|
||||||
|
|
||||||
|
- **Service crash** - fault that emulates abnormal program termination
|
||||||
|
by sending `SIGKILL` signal to service process(es).
|
||||||
|
|
||||||
|
- **Node crash** - fault that emulates unexpected power outage of hardware.
|
||||||
|
|
||||||
|
- **Network partition** - fault that result in connectivity loss between
|
||||||
|
service components running on different hardware nodes; used to toggle
|
||||||
|
split-brain conditions in HA service.
|
||||||
|
|
||||||
|
- **Network flapping** - fault that emulates disconnection of network
|
||||||
|
interface on hardware node or switch.
|
||||||
|
|
||||||
|
|
||||||
|
Test Plan
|
||||||
|
=========
|
||||||
|
|
||||||
|
Test Environment
|
||||||
|
----------------
|
||||||
|
|
||||||
|
Preparation
|
||||||
|
^^^^^^^^^^^
|
||||||
|
|
||||||
|
This test plan is executed against existing OpenStack cloud.
|
||||||
|
|
||||||
|
Measurements can be done with the tool that:
|
||||||
|
* is able to inject faults into existing OpenStack cloud at specified moment
|
||||||
|
of execution;
|
||||||
|
* collects duration of single operations and errors;
|
||||||
|
* calculates metrics specified in the test plan (e.g. MTTR, Service Downtime).
|
||||||
|
|
||||||
|
|
||||||
|
Environment description
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
The environment description includes hardware specification of servers,
|
||||||
|
network parameters, operation system and OpenStack deployment characteristics.
|
||||||
|
|
||||||
|
Hardware
|
||||||
|
~~~~~~~~
|
||||||
|
|
||||||
|
This section contains list of all types of hardware nodes.
|
||||||
|
|
||||||
|
+-----------+-------+----------------------------------------------------+
|
||||||
|
| Parameter | Value | Comments |
|
||||||
|
+-----------+-------+----------------------------------------------------+
|
||||||
|
| model | | e.g. Supermicro X9SRD-F |
|
||||||
|
+-----------+-------+----------------------------------------------------+
|
||||||
|
| CPU | | e.g. 6 x Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz |
|
||||||
|
+-----------+-------+----------------------------------------------------+
|
||||||
|
| role | | e.g. compute or network |
|
||||||
|
+-----------+-------+----------------------------------------------------+
|
||||||
|
|
||||||
|
Network
|
||||||
|
~~~~~~~
|
||||||
|
|
||||||
|
This section contains list of interfaces and network parameters.
|
||||||
|
For complicated cases this section may include topology diagram and switch
|
||||||
|
parameters.
|
||||||
|
|
||||||
|
+------------------+-------+-------------------------+
|
||||||
|
| Parameter | Value | Comments |
|
||||||
|
+------------------+-------+-------------------------+
|
||||||
|
| network role | | e.g. provider or public |
|
||||||
|
+------------------+-------+-------------------------+
|
||||||
|
| card model | | e.g. Intel |
|
||||||
|
+------------------+-------+-------------------------+
|
||||||
|
| driver | | e.g. ixgbe |
|
||||||
|
+------------------+-------+-------------------------+
|
||||||
|
| speed | | e.g. 10G or 1G |
|
||||||
|
+------------------+-------+-------------------------+
|
||||||
|
| MTU | | e.g. 9000 |
|
||||||
|
+------------------+-------+-------------------------+
|
||||||
|
| offloading modes | | e.g. default |
|
||||||
|
+------------------+-------+-------------------------+
|
||||||
|
|
||||||
|
Software
|
||||||
|
~~~~~~~~
|
||||||
|
|
||||||
|
This section describes installed software.
|
||||||
|
|
||||||
|
+-----------------+-------+---------------------------+
|
||||||
|
| Parameter | Value | Comments |
|
||||||
|
+-----------------+-------+---------------------------+
|
||||||
|
| OS | | e.g. Ubuntu 14.04.3 |
|
||||||
|
+-----------------+-------+---------------------------+
|
||||||
|
| OpenStack | | e.g. Liberty |
|
||||||
|
+-----------------+-------+---------------------------+
|
||||||
|
| Hypervisor | | e.g. KVM |
|
||||||
|
+-----------------+-------+---------------------------+
|
||||||
|
| Neutron plugin | | e.g. ML2 + OVS |
|
||||||
|
+-----------------+-------+---------------------------+
|
||||||
|
| L2 segmentation | | e.g. VLAN or VxLAN or GRE |
|
||||||
|
+-----------------+-------+---------------------------+
|
||||||
|
| virtual routers | | e.g. legacy or HA or DVR |
|
||||||
|
+-----------------+-------+---------------------------+
|
||||||
|
|
||||||
|
|
||||||
|
Test Case: Reliability Metrics Calculation
|
||||||
|
------------------------------------------
|
||||||
|
|
||||||
|
Description
|
||||||
|
^^^^^^^^^^^
|
||||||
|
|
||||||
|
The test case is performed by running a specific OpenStack operation with
|
||||||
|
injected fault. Every test is executed several times to collect more reliable
|
||||||
|
statistical data.
|
||||||
|
|
||||||
|
|
||||||
|
Parameters
|
||||||
|
^^^^^^^^^^
|
||||||
|
|
||||||
|
The test case is configured with:
|
||||||
|
* OpenStack operation that is tested (e.g. *network creation*);
|
||||||
|
* fault that is injected into execution pipeline (e.g. *service restart*);
|
||||||
|
|
||||||
|
Types of faults:
|
||||||
|
* Service-related:
|
||||||
|
|
||||||
|
* restart - service is stopped gracefully and then started;
|
||||||
|
* kill - service is terminated abruptly by OS;
|
||||||
|
* unplug/plug - service network partitioning.
|
||||||
|
|
||||||
|
* Node-related:
|
||||||
|
|
||||||
|
* reboot - node is rebooted gracefully;
|
||||||
|
* reset - cold restart of the node with potential data loss;
|
||||||
|
* poweroff/poweron - node is switched off and on;
|
||||||
|
* connect/disconnect - node's network interface is flapped.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
List of performance metrics
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
A particular fault may affect operations in different ways. Operations
|
||||||
|
may fail with error and we can count such errors and estimate how long the
|
||||||
|
downtime was. Operations may degrade in performance and we can compare
|
||||||
|
performance with base numbers. Also we can estimate time while the
|
||||||
|
performance was degraded.
|
||||||
|
|
||||||
|
If both errors and performance degradation are observed the image could
|
||||||
|
look like the following:
|
||||||
|
|
||||||
|
.. image:: hypothesis.png
|
||||||
|
|
||||||
|
Here the light blue line shows the mean operation duration, orange area is
|
||||||
|
where errors are observed and yellow where the performance is low.
|
||||||
|
|
||||||
|
Overall the following metrics need to be collected:
|
||||||
|
|
||||||
|
.. list-table::
|
||||||
|
:header-rows: 1
|
||||||
|
|
||||||
|
*
|
||||||
|
- Priority
|
||||||
|
- Value
|
||||||
|
- Measurement Unit
|
||||||
|
- Description
|
||||||
|
*
|
||||||
|
- 1
|
||||||
|
- Service downtime
|
||||||
|
- sec
|
||||||
|
- How long the service was not available and operations were in error
|
||||||
|
state.
|
||||||
|
*
|
||||||
|
- 1
|
||||||
|
- MTTR
|
||||||
|
- sec
|
||||||
|
- How long does it takes to recover service performance after the failure.
|
||||||
|
*
|
||||||
|
- 1
|
||||||
|
- Operation Degradation
|
||||||
|
- sec
|
||||||
|
- the mean of difference in operation performance during recovery period
|
||||||
|
and operation performance when service operates normally.
|
||||||
|
*
|
||||||
|
- 1
|
||||||
|
- Operation Degradation Ratio
|
||||||
|
- sec
|
||||||
|
- the ratio between operation performance during recovery period and
|
||||||
|
operation performance when service operates normally.
|
||||||
|
|
||||||
|
The final report may also contain one or more charts that show operation
|
||||||
|
behavior during the test.
|
||||||
|
|
||||||
|
|
||||||
|
Tools
|
||||||
|
=====
|
||||||
|
|
||||||
|
Rally + os-faults
|
||||||
|
-----------------
|
||||||
|
|
||||||
|
This test plan can be executed with `Rally`_ tool. Rally can report
|
||||||
|
duration of individual operations and report errors. Rally `Hooks` features
|
||||||
|
allows to call external code at specified moments of scenario execution.
|
||||||
|
|
||||||
|
`os-faults`_ library provides a generic way to inject faults into OpenStack
|
||||||
|
cloud. It supports both service and node based operations.
|
||||||
|
|
||||||
|
The integration between Rally and os-faults is implemented as Rally hooks
|
||||||
|
plugin: :download:`fault_injection.py <code/rally_plugins/fault_injection.py>`
|
||||||
|
|
||||||
|
Calculations
|
||||||
|
^^^^^^^^^^^^
|
||||||
|
|
||||||
|
Metrics calculations are based on raw data collected from Rally (Rally json
|
||||||
|
output). The raw data contains list of iterations with duration of each
|
||||||
|
iteration. If some operation failed the iteration contains error field. Also
|
||||||
|
raw data contains hook information, when it was started and its execution
|
||||||
|
status.
|
||||||
|
|
||||||
|
The period of scenario execution before the hook is interpreted as the
|
||||||
|
baseline. It is used to measure operation's baseline mean and deviation.
|
||||||
|
|
||||||
|
`Service downtime` is calculated as time interval between the first and the
|
||||||
|
last errors. The precision of calculation is average distance between the
|
||||||
|
last succeed operation and the first error, and the last error and the next
|
||||||
|
succeed operation.
|
||||||
|
|
||||||
|
To find the recovery period we first calculate the mean duration of
|
||||||
|
consequent operations with sliding window. The period is treated as
|
||||||
|
`Recovery period` when mean operation duration is significantly more than
|
||||||
|
the mean operation duration in the baseline. `Operation degradation` is
|
||||||
|
calculated as difference between mean of operation duration during Recovery
|
||||||
|
period and the baseline's. `Operation ratio` is the ratio between mean of
|
||||||
|
operation duration during Recovery period and the baseline's.
|
||||||
|
|
||||||
|
|
||||||
|
.. references:
|
||||||
|
|
||||||
|
.. _Rally: https://rally.readthedocs.io/
|
||||||
|
.. _os-faults: https://os-faults.readthedocs.io/
|
Loading…
Reference in New Issue
Block a user