Publish reports for reliability testing 2.0

Change-Id: Ibe31d2674dfde70c0c2d349154c1f94c8e4fc86e
This commit is contained in:
Ilya Shakhat 2016-09-28 16:13:50 +03:00
parent ad9803c183
commit 4d301f62fa
33 changed files with 75475 additions and 12 deletions

View File

@ -1,4 +1,4 @@
.. _reliability_testing: .. _reliability_testing_version_2:
========================================== ==========================================
OpenStack reliability testing. Version 2.0 OpenStack reliability testing. Version 2.0
@ -18,11 +18,13 @@ OpenStack reliability testing. Version 2.0
- **MTTR** - mean time to recover service performance after the fault. - **MTTR** - mean time to recover service performance after the fault.
- **Service Downtime** - the time when service was not available and number - **Service Downtime** - the time when service was not available.
of errors is more than defined by SLA.
- **Operation Degradation** - the difference in operation performance - **Absolute performance degradation** - is an absolute difference between
compared with performance when service operates normally. the mean of operation duration during recovery period and the baseline's.
- **Relative performance degradation** - is the ratio between the mean
of operation duration during recovery period and the baseline's.
- **Fault injection** - the function that emulates failure in software or - **Fault injection** - the function that emulates failure in software or
hardware. hardware.
@ -201,14 +203,14 @@ Overall the following metrics need to be collected:
- How long does it takes to recover service performance after the failure. - How long does it takes to recover service performance after the failure.
* *
- 1 - 1
- Operation Degradation - Absolute performance degradation
- sec - sec
- the mean of difference in operation performance during recovery period - the mean of difference in operation performance during recovery period
and operation performance when service operates normally. and operation performance when service operates normally.
* *
- 1 - 1
- Operation Degradation Ratio - Relative performance degradation
- sec - ratio
- the ratio between operation performance during recovery period and - the ratio between operation performance during recovery period and
operation performance when service operates normally. operation performance when service operates normally.
@ -252,13 +254,45 @@ succeed operation.
To find the recovery period we first calculate the mean duration of To find the recovery period we first calculate the mean duration of
consequent operations with sliding window. The period is treated as consequent operations with sliding window. The period is treated as
`Recovery period` when mean operation duration is significantly more than `Recovery period` when mean operation duration is significantly more than
the mean operation duration in the baseline. `Operation degradation` is the mean operation duration in the baseline. The average duration of Recovery
calculated as difference between mean of operation duration during Recovery period is `MTTR` value. `Absolute performance degradatio` is calculated as
period and the baseline's. `Operation ratio` is the ratio between mean of difference between mean of operation duration during Recovery period and
operation duration during Recovery period and the baseline's. the baseline's. `Relative performance degradation` is the ratio between
mean of operation duration during Recovery period and the baseline's.
How to run
^^^^^^^^^^
Prerequisites:
* Install `Rally` tool and configure deployment parameters
* Verify that Rally is properly installed by running ``rally show flavors``
* Install `os-faults` library: ``pip install os-faults``
* Configure cloud and power management parameters, refer to `os-faults-cfg`
* Verify parameters by running ``os-inject-fault -v``
* Install `RallyRunners` tool: ``pip install rally-runners``
Run scenarios:
``rally-reliability -s SCENARIO -o OUTPUT -b BOOK``
To show full list of scenarios:
``rally-reliability -h``
Reports
=======
Test plan execution reports:
* :ref:`reliability_test_results_version_2`
.. references: .. references:
.. _Rally: https://rally.readthedocs.io/ .. _Rally: https://rally.readthedocs.io/
.. _os-faults: https://os-faults.readthedocs.io/ .. _os-faults: https://os-faults.readthedocs.io/
.. _os-faults-cfg: http://os-faults.readthedocs.io/en/latest/readme.html#usage
.. _RallyRunners: https://github.com/shakhat/rally-runners

View File

@ -0,0 +1,42 @@
.. _reliability_test_results_version_2:
========================================
OpenStack reliability testing. Version 2
========================================
Test results
============
Environment description
^^^^^^^^^^^^^^^^^^^^^^^
This report contains results for :ref:`reliability_testing_version_2`
test plan. The data is collected in :ref:`intel_mirantis_performance_lab`.
Software
~~~~~~~~
This section describes installed software.
+-----------------+--------------------------------------------+
| Parameter | Value |
+-----------------+--------------------------------------------+
| OS | Ubuntu 14.04.3 |
+-----------------+--------------------------------------------+
| OpenStack | Fuel 9.0 (Mitaka) |
+-----------------+--------------------------------------------+
| Networking | Neutron OVS ML2 plugin with VxLAN and DVR |
+-----------------+--------------------------------------------+
Reports
^^^^^^^
.. toctree::
:maxdepth: 1
:glob:
reports/*/*/index
Reports are calculated on :download:`Raw Rally data <raw/raw_data.tar.xz>`

View File

@ -0,0 +1,296 @@
Keystone authentication with kill of Keystone on one node
=========================================================
This report is generated on results collected by execution of the following
Rally scenario:
.. code-block:: yaml
---
{% set repeat = repeat|default(5) %}
Authenticate.keystone:
{% for iteration in range(repeat) %}
-
runner:
type: "constant_for_duration"
duration: 30
concurrency: 20
context:
users:
tenants: 1
users_per_tenant: 1
hooks:
-
name: fault_injection
args:
action: kill keystone service on one node
trigger:
name: event
args:
unit: iteration
at: [100]
{% endfor %}
Summary
-------
In Fuel architecture Keystone is deployed behind Apache2, which in turn are
behind NGINX front-end. In this scenario we kill Keystone processes running
on one of controller nodes.
+-----------------------+------------+---------------------------------------+-------------------------------------------+
| Service downtime, s | MTTR, s | Absolute performance degradation, s | Relative performance degradation, ratio |
+=======================+============+=======================================+===========================================+
| 0.038 ±0.081 | 2.28 ±0.23 | 1.21 ±0.35 | 9.1 ±2.3 |
+-----------------------+------------+---------------------------------------+-------------------------------------------+
Metrics:
* `Service downtime` is the time interval between the first and
the last errors.
* `MTTR` is the mean time to recover service performance after
the fault.
* `Absolute performance degradation` is an absolute difference between
the mean of operation duration during recovery period and the baseline's.
* `Relative performance degradation` is the ratio between the mean
of operation duration during recovery period and the baseline's.
Details
-------
This section contains individual data for particular scenario runs.
Run #1
^^^^^^
.. image:: plot_1.svg
Baseline
~~~~~~~~
Baseline samples are collected before the start of fault injection. They are
used to estimate service performance degradation after the fault.
+-----------+-------------+-----------+-----------+---------------------+
| Samples | Median, s | Mean, s | Std dev | 95% percentile, s |
+===========+=============+===========+===========+=====================+
| 78 | 0.12 | 0.13 | 0.041 | 0.23 |
+-----------+-------------+-----------+-----------+---------------------+
Service downtime
~~~~~~~~~~~~~~~~
The tested service is not available during the following time period(s).
+-----+----------------+
| # | Downtime, s |
+=====+================+
| 1 | 0.0034 ±0.0034 |
+-----+----------------+
| 2 | 0.0282 ±0.0014 |
+-----+----------------+
Service performance degradation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The tested service has measurable performance degradation during the
following time period(s).
+-----+----------------------+---------------------------+------------------------+
| # | Time to recover, s | Absolute degradation, s | Relative degradation |
+=====+======================+===========================+========================+
| 1 | 2.711 ±0.023 | 1.30 ±0.39 | 10.8 ±3.0 |
+-----+----------------------+---------------------------+------------------------+
Run #2
^^^^^^
.. image:: plot_2.svg
Baseline
~~~~~~~~
Baseline samples are collected before the start of fault injection. They are
used to estimate service performance degradation after the fault.
+-----------+-------------+-----------+-----------+---------------------+
| Samples | Median, s | Mean, s | Std dev | 95% percentile, s |
+===========+=============+===========+===========+=====================+
| 70 | 0.14 | 0.15 | 0.048 | 0.24 |
+-----------+-------------+-----------+-----------+---------------------+
Service downtime
~~~~~~~~~~~~~~~~
The tested service is not available during the following time period(s).
+-----+----------------+
| # | Downtime, s |
+=====+================+
| 1 | 0.0047 ±0.0047 |
+-----+----------------+
Service performance degradation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The tested service has measurable performance degradation during the
following time period(s).
+-----+----------------------+---------------------------+------------------------+
| # | Time to recover, s | Absolute degradation, s | Relative degradation |
+=====+======================+===========================+========================+
| 1 | 2.722 ±0.026 | 1.66 ±0.43 | 11.9 ±2.9 |
+-----+----------------------+---------------------------+------------------------+
Run #3
^^^^^^
.. image:: plot_3.svg
Baseline
~~~~~~~~
Baseline samples are collected before the start of fault injection. They are
used to estimate service performance degradation after the fault.
+-----------+-------------+-----------+-----------+---------------------+
| Samples | Median, s | Mean, s | Std dev | 95% percentile, s |
+===========+=============+===========+===========+=====================+
| 84 | 0.15 | 0.16 | 0.058 | 0.27 |
+-----------+-------------+-----------+-----------+---------------------+
Service downtime
~~~~~~~~~~~~~~~~
The tested service is not available during the following time period(s).
+-----+----------------+
| # | Downtime, s |
+=====+================+
| 1 | 0.1147 ±0.0067 |
+-----+----------------+
Service performance degradation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The tested service has measurable performance degradation during the
following time period(s).
+-----+----------------------+---------------------------+------------------------+
| # | Time to recover, s | Absolute degradation, s | Relative degradation |
+=====+======================+===========================+========================+
| 1 | 2.317 ±0.019 | 1.07 ±0.35 | 7.5 ±2.1 |
+-----+----------------------+---------------------------+------------------------+
Run #4
^^^^^^
.. image:: plot_4.svg
Baseline
~~~~~~~~
Baseline samples are collected before the start of fault injection. They are
used to estimate service performance degradation after the fault.
+-----------+-------------+-----------+-----------+---------------------+
| Samples | Median, s | Mean, s | Std dev | 95% percentile, s |
+===========+=============+===========+===========+=====================+
| 87 | 0.14 | 0.16 | 0.051 | 0.25 |
+-----------+-------------+-----------+-----------+---------------------+
Service downtime
~~~~~~~~~~~~~~~~
The tested service is not available during the following time period(s).
+-----+----------------+
| # | Downtime, s |
+=====+================+
| 1 | 0.0057 ±0.0057 |
+-----+----------------+
Service performance degradation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The tested service has measurable performance degradation during the
following time period(s).
+-----+----------------------+---------------------------+------------------------+
| # | Time to recover, s | Absolute degradation, s | Relative degradation |
+=====+======================+===========================+========================+
| 1 | 1.695 ±0.015 | 1.11 ±0.29 | 8.0 ±1.8 |
+-----+----------------------+---------------------------+------------------------+
Run #5
^^^^^^
.. image:: plot_5.svg
Baseline
~~~~~~~~
Baseline samples are collected before the start of fault injection. They are
used to estimate service performance degradation after the fault.
+-----------+-------------+-----------+-----------+---------------------+
| Samples | Median, s | Mean, s | Std dev | 95% percentile, s |
+===========+=============+===========+===========+=====================+
| 87 | 0.14 | 0.15 | 0.051 | 0.26 |
+-----------+-------------+-----------+-----------+---------------------+
Service downtime
~~~~~~~~~~~~~~~~
The tested service is not available during the following time period(s).
+-----+----------------+
| # | Downtime, s |
+=====+================+
| 1 | 0.0166 ±0.0044 |
+-----+----------------+
| 2 | 0.0162 ±0.0044 |
+-----+----------------+
Service performance degradation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The tested service has measurable performance degradation during the
following time period(s).
+-----+----------------------+---------------------------+------------------------+
| # | Time to recover, s | Absolute degradation, s | Relative degradation |
+=====+======================+===========================+========================+
| 1 | 1.976 ±0.015 | 0.93 ±0.29 | 7.1 ±1.9 |
+-----+----------------------+---------------------------+------------------------+

File diff suppressed because it is too large Load Diff

After

Width:  |  Height:  |  Size: 455 KiB

File diff suppressed because it is too large Load Diff

After

Width:  |  Height:  |  Size: 469 KiB

File diff suppressed because it is too large Load Diff

After

Width:  |  Height:  |  Size: 450 KiB

File diff suppressed because it is too large Load Diff

After

Width:  |  Height:  |  Size: 469 KiB

File diff suppressed because it is too large Load Diff

After

Width:  |  Height:  |  Size: 460 KiB

View File

@ -0,0 +1,98 @@
Keystone authentication with kill of MySQL on one node
======================================================
This report is generated on results collected by execution of the following
Rally scenario:
.. code-block:: yaml
---
Authenticate.keystone:
-
runner:
type: "constant_for_duration"
duration: 60
concurrency: 5
context:
users:
tenants: 1
users_per_tenant: 1
hooks:
-
name: fault_injection
args:
action: kill mysql service on one node
trigger:
name: event
args:
unit: iteration
at: [150]
Summary
-------
In this scenario we kill one of MySQL servers while working with Keystone API.
In Fuel architecture MySQL is deployed with Galera in active-active mode,
however Keystone looses connection to DB with the following traces::
(_mysql_exceptions.OperationalError) (2013, "Lost connection to MySQL
server at 'reading initial communication packet', system error: 0")
+-----------------------+-----------+---------------------------------------+-------------------------------------------+
| Service downtime, s | MTTR, s | Absolute performance degradation, s | Relative performance degradation, ratio |
+=======================+===========+=======================================+===========================================+
| 14.7 ±1.4 | N/A | N/A | N/A |
+-----------------------+-----------+---------------------------------------+-------------------------------------------+
Metrics:
* `Service downtime` is the time interval between the first and
the last errors.
* `MTTR` is the mean time to recover service performance after
the fault.
* `Absolute performance degradation` is an absolute difference between
the mean of operation duration during recovery period and the baseline's.
* `Relative performance degradation` is the ratio between the mean
of operation duration during recovery period and the baseline's.
Details
-------
This section contains individual data for particular scenario runs.
Run #1
^^^^^^
.. image:: plot_1.svg
Baseline
~~~~~~~~
Baseline samples are collected before the start of fault injection. They are
used to estimate service performance degradation after the fault.
+-----------+-------------+-----------+-----------+---------------------+
| Samples | Median, s | Mean, s | Std dev | 95% percentile, s |
+===========+=============+===========+===========+=====================+
| 135 | 0.071 | 0.074 | 0.012 | 0.09 |
+-----------+-------------+-----------+-----------+---------------------+
Service downtime
~~~~~~~~~~~~~~~~
The tested service is not available during the following time period(s).
+-----+---------------+
| # | Downtime, s |
+=====+===============+
| 1 | 14.7 ±2.0 |
+-----+---------------+

File diff suppressed because it is too large Load Diff

After

Width:  |  Height:  |  Size: 405 KiB

View File

@ -0,0 +1,292 @@
Keystone authentication with Keystone API restart on one node
=============================================================
This report is generated on results collected by execution of the following
Rally scenario:
.. code-block:: yaml
---
{% set repeat = repeat|default(5) %}
Authenticate.keystone:
{% for iteration in range(repeat) %}
-
runner:
type: "constant_for_duration"
duration: 30
concurrency: 5
context:
users:
tenants: 1
users_per_tenant: 1
hooks:
-
name: fault_injection
args:
action: restart keystone service on one node
trigger:
name: event
args:
unit: iteration
at: [100]
{% endfor %}
Summary
-------
In Fuel architecture Keystone is deployed behind Apache2, which in turn are
behind NGINX front-end. In this scenario we restart Apache2 service, as result
Keystone becomes unavailable on one of controller nodes.
+-----------------------+------------+---------------------------------------+-------------------------------------------+
| Service downtime, s | MTTR, s | Absolute performance degradation, s | Relative performance degradation, ratio |
+=======================+============+=======================================+===========================================+
| 1.07 ±0.76 | 5.44 ±0.47 | 0.41 ±0.22 | 4.7 ±2.0 |
+-----------------------+------------+---------------------------------------+-------------------------------------------+
Metrics:
* `Service downtime` is the time interval between the first and
the last errors.
* `MTTR` is the mean time to recover service performance after
the fault.
* `Absolute performance degradation` is an absolute difference between
the mean of operation duration during recovery period and the baseline's.
* `Relative performance degradation` is the ratio between the mean
of operation duration during recovery period and the baseline's.
Details
-------
This section contains individual data for particular scenario runs.
Run #1
^^^^^^
.. image:: plot_1.svg
Baseline
~~~~~~~~
Baseline samples are collected before the start of fault injection. They are
used to estimate service performance degradation after the fault.
+-----------+-------------+-----------+-----------+---------------------+
| Samples | Median, s | Mean, s | Std dev | 95% percentile, s |
+===========+=============+===========+===========+=====================+
| 84 | 0.071 | 0.077 | 0.017 | 0.13 |
+-----------+-------------+-----------+-----------+---------------------+
Service downtime
~~~~~~~~~~~~~~~~
The tested service is not available during the following time period(s).
+-----+---------------+
| # | Downtime, s |
+=====+===============+
| 1 | 0.88 ±0.75 |
+-----+---------------+
Service performance degradation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The tested service has measurable performance degradation during the
following time period(s).
+-----+----------------------+---------------------------+------------------------+
| # | Time to recover, s | Absolute degradation, s | Relative degradation |
+=====+======================+===========================+========================+
| 1 | 3.549 ±0.034 | 0.51 ±0.25 | 7.6 ±3.3 |
+-----+----------------------+---------------------------+------------------------+
Run #2
^^^^^^
.. image:: plot_2.svg
Baseline
~~~~~~~~
Baseline samples are collected before the start of fault injection. They are
used to estimate service performance degradation after the fault.
+-----------+-------------+-----------+-----------+---------------------+
| Samples | Median, s | Mean, s | Std dev | 95% percentile, s |
+===========+=============+===========+===========+=====================+
| 84 | 0.13 | 0.13 | 0.0086 | 0.14 |
+-----------+-------------+-----------+-----------+---------------------+
Service downtime
~~~~~~~~~~~~~~~~
The tested service is not available during the following time period(s).
+-----+---------------+
| # | Downtime, s |
+=====+===============+
| 1 | 1.00 ±0.87 |
+-----+---------------+
Service performance degradation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The tested service has measurable performance degradation during the
following time period(s).
+-----+----------------------+---------------------------+------------------------+
| # | Time to recover, s | Absolute degradation, s | Relative degradation |
+=====+======================+===========================+========================+
| 1 | 6.038 ±0.034 | 0.35 ±0.17 | 3.7 ±1.3 |
+-----+----------------------+---------------------------+------------------------+
Run #3
^^^^^^
.. image:: plot_3.svg
Baseline
~~~~~~~~
Baseline samples are collected before the start of fault injection. They are
used to estimate service performance degradation after the fault.
+-----------+-------------+-----------+-----------+---------------------+
| Samples | Median, s | Mean, s | Std dev | 95% percentile, s |
+===========+=============+===========+===========+=====================+
| 84 | 0.13 | 0.12 | 0.0077 | 0.14 |
+-----------+-------------+-----------+-----------+---------------------+
Service downtime
~~~~~~~~~~~~~~~~
The tested service is not available during the following time period(s).
+-----+---------------+
| # | Downtime, s |
+=====+===============+
| 1 | 0.26 ±0.12 |
+-----+---------------+
Service performance degradation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The tested service has measurable performance degradation during the
following time period(s).
+-----+----------------------+---------------------------+------------------------+
| # | Time to recover, s | Absolute degradation, s | Relative degradation |
+=====+======================+===========================+========================+
| 1 | 6.123 ±0.037 | 0.43 ±0.25 | 4.4 ±2.0 |
+-----+----------------------+---------------------------+------------------------+
Run #4
^^^^^^
.. image:: plot_4.svg
Baseline
~~~~~~~~
Baseline samples are collected before the start of fault injection. They are
used to estimate service performance degradation after the fault.
+-----------+-------------+-----------+-----------+---------------------+
| Samples | Median, s | Mean, s | Std dev | 95% percentile, s |
+===========+=============+===========+===========+=====================+
| 84 | 0.13 | 0.13 | 0.0089 | 0.14 |
+-----------+-------------+-----------+-----------+---------------------+
Service downtime
~~~~~~~~~~~~~~~~
The tested service is not available during the following time period(s).
+-----+---------------+
| # | Downtime, s |
+=====+===============+
| 1 | 1.02 ±0.73 |
+-----+---------------+
Service performance degradation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The tested service has measurable performance degradation during the
following time period(s).
+-----+----------------------+---------------------------+------------------------+
| # | Time to recover, s | Absolute degradation, s | Relative degradation |
+=====+======================+===========================+========================+
| 1 | 5.860 ±0.027 | 0.25 ±0.13 | 2.9 ±1.1 |
+-----+----------------------+---------------------------+------------------------+
Run #5
^^^^^^
.. image:: plot_5.svg
Baseline
~~~~~~~~
Baseline samples are collected before the start of fault injection. They are
used to estimate service performance degradation after the fault.
+-----------+-------------+-----------+-----------+---------------------+
| Samples | Median, s | Mean, s | Std dev | 95% percentile, s |
+===========+=============+===========+===========+=====================+
| 87 | 0.13 | 0.13 | 0.019 | 0.14 |
+-----------+-------------+-----------+-----------+---------------------+
Service downtime
~~~~~~~~~~~~~~~~
The tested service is not available during the following time period(s).
+-----+---------------+
| # | Downtime, s |
+=====+===============+
| 1 | 2.173 ±0.067 |
+-----+---------------+
Service performance degradation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The tested service has measurable performance degradation during the
following time period(s).
+-----+----------------------+---------------------------+------------------------+
| # | Time to recover, s | Absolute degradation, s | Relative degradation |
+=====+======================+===========================+========================+
| 1 | 5.630 ±0.048 | 0.52 ±0.30 | 5.0 ±2.3 |
+-----+----------------------+---------------------------+------------------------+

File diff suppressed because it is too large Load Diff

After

Width:  |  Height:  |  Size: 255 KiB

File diff suppressed because it is too large Load Diff

After

Width:  |  Height:  |  Size: 165 KiB

File diff suppressed because it is too large Load Diff

After

Width:  |  Height:  |  Size: 166 KiB

File diff suppressed because it is too large Load Diff

After

Width:  |  Height:  |  Size: 166 KiB

File diff suppressed because it is too large Load Diff

After

Width:  |  Height:  |  Size: 217 KiB

View File

@ -0,0 +1,201 @@
Keystone authentication with memached restart on one node
=========================================================
This report is generated on results collected by execution of the following
Rally scenario:
.. code-block:: yaml
---
{% set repeat = repeat|default(5) %}
Authenticate.keystone:
{% for iteration in range(repeat) %}
-
runner:
type: "constant_for_duration"
duration: 30
concurrency: 5
context:
users:
tenants: 1
users_per_tenant: 1
hooks:
-
name: fault_injection
args:
action: restart memcached service on one node
trigger:
name: event
args:
unit: iteration
at: [100]
{% endfor %}
Summary
-------
In this scenario we restart Memcached service on one of controller nodes.
Memcached is used as caching backend for Keystone, thus it's expected that
Keystone performance may degrade.
+-----------------------+--------------+---------------------------------------+-------------------------------------------+
| Service downtime, s | MTTR, s | Absolute performance degradation, s | Relative performance degradation, ratio |
+=======================+==============+=======================================+===========================================+
| N/A | 0.458 ±0.068 | 0.057 ±0.034 | 1.46 ±0.27 |
+-----------------------+--------------+---------------------------------------+-------------------------------------------+
Metrics:
* `Service downtime` is the time interval between the first and
the last errors.
* `MTTR` is the mean time to recover service performance after
the fault.
* `Absolute performance degradation` is an absolute difference between
the mean of operation duration during recovery period and the baseline's.
* `Relative performance degradation` is the ratio between the mean
of operation duration during recovery period and the baseline's.
Details
-------
This section contains individual data for particular scenario runs.
Run #1
^^^^^^
.. image:: plot_1.svg
Baseline
~~~~~~~~
Baseline samples are collected before the start of fault injection. They are
used to estimate service performance degradation after the fault.
+-----------+-------------+-----------+-----------+---------------------+
| Samples | Median, s | Mean, s | Std dev | 95% percentile, s |
+===========+=============+===========+===========+=====================+
| 88 | 0.12 | 0.12 | 0.014 | 0.13 |
+-----------+-------------+-----------+-----------+---------------------+
Run #2
^^^^^^
.. image:: plot_2.svg
Baseline
~~~~~~~~
Baseline samples are collected before the start of fault injection. They are
used to estimate service performance degradation after the fault.
+-----------+-------------+-----------+-----------+---------------------+
| Samples | Median, s | Mean, s | Std dev | 95% percentile, s |
+===========+=============+===========+===========+=====================+
| 84 | 0.12 | 0.12 | 0.0078 | 0.13 |
+-----------+-------------+-----------+-----------+---------------------+
Service performance degradation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The tested service has measurable performance degradation during the
following time period(s).
+-----+----------------------+---------------------------+------------------------+
| # | Time to recover, s | Absolute degradation, s | Relative degradation |
+=====+======================+===========================+========================+
| 1 | 0.4059 ±0.0027 | 0.069 ±0.030 | 1.57 ±0.25 |
+-----+----------------------+---------------------------+------------------------+
Run #3
^^^^^^
.. image:: plot_3.svg
Baseline
~~~~~~~~
Baseline samples are collected before the start of fault injection. They are
used to estimate service performance degradation after the fault.
+-----------+-------------+-----------+-----------+---------------------+
| Samples | Median, s | Mean, s | Std dev | 95% percentile, s |
+===========+=============+===========+===========+=====================+
| 88 | 0.12 | 0.13 | 0.017 | 0.15 |
+-----------+-------------+-----------+-----------+---------------------+
Run #4
^^^^^^
.. image:: plot_4.svg
Baseline
~~~~~~~~
Baseline samples are collected before the start of fault injection. They are
used to estimate service performance degradation after the fault.
+-----------+-------------+-----------+-----------+---------------------+
| Samples | Median, s | Mean, s | Std dev | 95% percentile, s |
+===========+=============+===========+===========+=====================+
| 84 | 0.12 | 0.12 | 0.01 | 0.14 |
+-----------+-------------+-----------+-----------+---------------------+
Run #5
^^^^^^
.. image:: plot_5.svg
Baseline
~~~~~~~~
Baseline samples are collected before the start of fault injection. They are
used to estimate service performance degradation after the fault.
+-----------+-------------+-----------+-----------+---------------------+
| Samples | Median, s | Mean, s | Std dev | 95% percentile, s |
+===========+=============+===========+===========+=====================+
| 84 | 0.13 | 0.13 | 0.0086 | 0.14 |
+-----------+-------------+-----------+-----------+---------------------+
Service performance degradation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The tested service has measurable performance degradation during the
following time period(s).
+-----+----------------------+---------------------------+------------------------+
| # | Time to recover, s | Absolute degradation, s | Relative degradation |
+=====+======================+===========================+========================+
| 1 | 0.5110 ±0.0037 | 0.045 ±0.037 | 1.35 ±0.29 |
+-----+----------------------+---------------------------+------------------------+

File diff suppressed because it is too large Load Diff

After

Width:  |  Height:  |  Size: 203 KiB

File diff suppressed because it is too large Load Diff

After

Width:  |  Height:  |  Size: 204 KiB

File diff suppressed because it is too large Load Diff

After

Width:  |  Height:  |  Size: 200 KiB

File diff suppressed because it is too large Load Diff

After

Width:  |  Height:  |  Size: 198 KiB

File diff suppressed because it is too large Load Diff

After

Width:  |  Height:  |  Size: 196 KiB

View File

@ -0,0 +1,160 @@
Create and list networks with kill of one of MySQL servers
==========================================================
This report is generated on results collected by execution of the following
Rally scenario:
.. code-block:: yaml
---
{% set repeat = repeat|default(3) %}
NeutronNetworks.create_and_list_networks:
{% for iteration in range(repeat) %}
-
args:
network_create_args: {}
runner:
type: "constant_for_duration"
duration: 60
concurrency: 4
context:
users:
tenants: 1
users_per_tenant: 1
quotas:
neutron:
network: -1
hooks:
-
name: fault_injection
args:
action: kill mysql service on one node
trigger:
name: event
args:
unit: iteration
at: [100]
{% endfor %}
Summary
-------
In this scenario we kill one of MySQL servers while working with Neutron API.
In Fuel architecture MySQL is deployed with Galera in active-active mode, thus
no dramatic impact should occur.
+-----------------------+------------+---------------------------------------+-------------------------------------------+
| Service downtime, s | MTTR, s | Absolute performance degradation, s | Relative performance degradation, ratio |
+=======================+============+=======================================+===========================================+
| N/A | 7.73 ±0.72 | 1.4 ±1.1 | 3.8 ±2.3 |
+-----------------------+------------+---------------------------------------+-------------------------------------------+
Metrics:
* `Service downtime` is the time interval between the first and
the last errors.
* `MTTR` is the mean time to recover service performance after
the fault.
* `Absolute performance degradation` is an absolute difference between
the mean of operation duration during recovery period and the baseline's.
* `Relative performance degradation` is the ratio between the mean
of operation duration during recovery period and the baseline's.
Details
-------
This section contains individual data for particular scenario runs.
Run #1
^^^^^^
.. image:: plot_1.svg
Baseline
~~~~~~~~
Baseline samples are collected before the start of fault injection. They are
used to estimate service performance degradation after the fault.
+-----------+-------------+-----------+-----------+---------------------+
| Samples | Median, s | Mean, s | Std dev | 95% percentile, s |
+===========+=============+===========+===========+=====================+
| 86 | 0.48 | 0.8 | 0.49 | 1.6 |
+-----------+-------------+-----------+-----------+---------------------+
Run #2
^^^^^^
.. image:: plot_2.svg
Baseline
~~~~~~~~
Baseline samples are collected before the start of fault injection. They are
used to estimate service performance degradation after the fault.
+-----------+-------------+-----------+-----------+---------------------+
| Samples | Median, s | Mean, s | Std dev | 95% percentile, s |
+===========+=============+===========+===========+=====================+
| 85 | 0.46 | 0.5 | 0.12 | 0.7 |
+-----------+-------------+-----------+-----------+---------------------+
Service performance degradation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The tested service has measurable performance degradation during the
following time period(s).
+-----+----------------------+---------------------------+------------------------+
| # | Time to recover, s | Absolute degradation, s | Relative degradation |
+=====+======================+===========================+========================+
| 1 | 6.824 ±0.093 | 1.5 ±1.2 | 4.1 ±2.5 |
+-----+----------------------+---------------------------+------------------------+
Run #3
^^^^^^
.. image:: plot_3.svg
Baseline
~~~~~~~~
Baseline samples are collected before the start of fault injection. They are
used to estimate service performance degradation after the fault.
+-----------+-------------+-----------+-----------+---------------------+
| Samples | Median, s | Mean, s | Std dev | 95% percentile, s |
+===========+=============+===========+===========+=====================+
| 85 | 0.45 | 0.47 | 0.065 | 0.61 |
+-----------+-------------+-----------+-----------+---------------------+
Service performance degradation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The tested service has measurable performance degradation during the
following time period(s).
+-----+----------------------+---------------------------+------------------------+
| # | Time to recover, s | Absolute degradation, s | Relative degradation |
+=====+======================+===========================+========================+
| 1 | 8.63 ±0.12 | 1.18 ±1.00 | 3.5 ±2.1 |
+-----+----------------------+---------------------------+------------------------+

File diff suppressed because it is too large Load Diff

After

Width:  |  Height:  |  Size: 90 KiB

File diff suppressed because it is too large Load Diff

After

Width:  |  Height:  |  Size: 101 KiB

File diff suppressed because it is too large Load Diff

After

Width:  |  Height:  |  Size: 100 KiB

View File

@ -0,0 +1,119 @@
Boot and delete VM with disabling management network on one of controllers
==========================================================================
This report is generated on results collected by execution of the following
Rally scenario:
.. code-block:: yaml
---
NovaServers.boot_and_delete_server:
-
args:
flavor:
name: "m1.micro"
image:
name: "(^cirros.*uec$|TestVM)"
force_delete: false
runner:
type: "constant_for_duration"
duration: 600
concurrency: 4
context:
users:
tenants: 1
users_per_tenant: 1
hooks:
-
name: fault_injection
args:
action: disconnect management network on one node with nova-scheduler service
trigger:
name: event
args:
unit: iteration
at: [50]
Summary
-------
In this scenario we disable management network interface on one of controllers
(in Fuel architecture controller runs DB, MQ, API services, scheduler).
This emulates the case with networking outage (network port failure on machine
or switch).
The outage causes all services to become unreachable from outside. Moreover,
the cluster remains broken even 10 minutes after the fault.
+-----------------------+------------+---------------------------------------+-------------------------------------------+
| Service downtime, s | MTTR, s | Absolute performance degradation, s | Relative performance degradation, ratio |
+=======================+============+=======================================+===========================================+
| 358.0 ±2.7 | 149.0 ±2.1 | 24 ±17 | 5.7 ±3.4 |
+-----------------------+------------+---------------------------------------+-------------------------------------------+
Metrics:
* `Service downtime` is the time interval between the first and
the last errors.
* `MTTR` is the mean time to recover service performance after
the fault.
* `Absolute performance degradation` is an absolute difference between
the mean of operation duration during recovery period and the baseline's.
* `Relative performance degradation` is the ratio between the mean
of operation duration during recovery period and the baseline's.
Details
-------
This section contains individual data for particular scenario runs.
Run #1
^^^^^^
.. image:: plot_1.svg
Baseline
~~~~~~~~
Baseline samples are collected before the start of fault injection. They are
used to estimate service performance degradation after the fault.
+-----------+-------------+-----------+-----------+---------------------+
| Samples | Median, s | Mean, s | Std dev | 95% percentile, s |
+===========+=============+===========+===========+=====================+
| 36 | 5.5 | 5.2 | 0.6 | 6 |
+-----------+-------------+-----------+-----------+---------------------+
Service downtime
~~~~~~~~~~~~~~~~
The tested service is not available during the following time period(s).
+-----+---------------+
| # | Downtime, s |
+=====+===============+
| 1 | 126.32 ±0.82 |
+-----+---------------+
| 2 | 231.7 ±6.5 |
+-----+---------------+
Service performance degradation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The tested service has measurable performance degradation during the
following time period(s).
+-----+----------------------+---------------------------+------------------------+
| # | Time to recover, s | Absolute degradation, s | Relative degradation |
+=====+======================+===========================+========================+
| 1 | 149.0 ±4.6 | 24 ±17 | 5.7 ±3.4 |
+-----+----------------------+---------------------------+------------------------+

View File

@ -0,0 +1,81 @@
Boot and delete VM with kill of RabbitMQ on one of nodes
========================================================
This report is generated on results collected by execution of the following
Rally scenario:
.. code-block:: yaml
---
NovaServers.boot_and_delete_server:
-
args:
flavor:
name: "m1.micro"
image:
name: "(^cirros.*uec$|TestVM)"
force_delete: false
runner:
type: "constant_for_duration"
duration: 240
concurrency: 4
context:
users:
tenants: 1
users_per_tenant: 1
hooks:
-
name: fault_injection
args:
action: kill rabbitmq service on one node
trigger:
name: event
args:
unit: iteration
at: [60]
Summary
-------
In this scenario we kill one of running RabbitMQ servers. Once killed RabbitMQ
gets restarted automatically by Pacemaker.
The cloud stays stable, no errors, nor significant performance degradation
observed. Oslo.messaging library handles the loss of connection to RabbitMQ
and reconnects to one of other servers automatically::
AMQP server on 10.43.0.3:5673 is unreachable: timed out. Trying again in
1 seconds.
...
Reconnected to AMQP server on 10.43.0.6:5673 via [amqp] client
Details
-------
This section contains individual data for particular scenario runs.
Run #1
^^^^^^
.. image:: plot_1.svg
Baseline
~~~~~~~~
Baseline samples are collected before the start of fault injection. They are
used to estimate service performance degradation after the fault.
+-----------+-------------+-----------+-----------+---------------------+
| Samples | Median, s | Mean, s | Std dev | 95% percentile, s |
+===========+=============+===========+===========+=====================+
| 45 | 5.8 | 5.8 | 0.3 | 6.1 |
+-----------+-------------+-----------+-----------+---------------------+

File diff suppressed because it is too large Load Diff

After

Width:  |  Height:  |  Size: 66 KiB

View File

@ -0,0 +1,113 @@
Boot and delete VM with reboot of one of controllers
====================================================
This report is generated on results collected by execution of the following
Rally scenario:
.. code-block:: yaml
---
NovaServers.boot_and_delete_server:
-
args:
flavor:
name: "m1.micro"
image:
name: "(^cirros.*uec$|TestVM)"
force_delete: false
runner:
type: "constant_for_duration"
duration: 600
concurrency: 4
context:
users:
tenants: 1
users_per_tenant: 1
hooks:
-
name: fault_injection
args:
action: reboot one node with rabbitmq service
trigger:
name: event
args:
unit: iteration
at: [50]
Summary
-------
In this scenario we reboot one of controllers (in Fuel architecture controller
runs DB, MQ, API services, scheduler). The observed recovery period corresponds
to time needed for a node to reboot, start services and get back to sync state.
+-----------------------+--------------+---------------------------------------+-------------------------------------------+
| Service downtime, s | MTTR, s | Absolute performance degradation, s | Relative performance degradation, ratio |
+=======================+==============+=======================================+===========================================+
| 8.7 ±1.6 | 286.89 ±0.87 | 14.7 ±4.7 | 3.85 ±0.91 |
+-----------------------+--------------+---------------------------------------+-------------------------------------------+
Metrics:
* `Service downtime` is the time interval between the first and
the last errors.
* `MTTR` is the mean time to recover service performance after
the fault.
* `Absolute performance degradation` is an absolute difference between
the mean of operation duration during recovery period and the baseline's.
* `Relative performance degradation` is the ratio between the mean
of operation duration during recovery period and the baseline's.
Details
-------
This section contains individual data for particular scenario runs.
Run #1
^^^^^^
.. image:: plot_1.svg
Baseline
~~~~~~~~
Baseline samples are collected before the start of fault injection. They are
used to estimate service performance degradation after the fault.
+-----------+-------------+-----------+-----------+---------------------+
| Samples | Median, s | Mean, s | Std dev | 95% percentile, s |
+===========+=============+===========+===========+=====================+
| 36 | 5.1 | 5.2 | 0.63 | 6.1 |
+-----------+-------------+-----------+-----------+---------------------+
Service downtime
~~~~~~~~~~~~~~~~
The tested service is not available during the following time period(s).
+-----+---------------+
| # | Downtime, s |
+=====+===============+
| 1 | 8.7 ±2.5 |
+-----+---------------+
Service performance degradation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The tested service has measurable performance degradation during the
following time period(s).
+-----+----------------------+---------------------------+------------------------+
| # | Time to recover, s | Absolute degradation, s | Relative degradation |
+=====+======================+===========================+========================+
| 1 | 286.89 ±0.76 | 14.7 ±4.7 | 3.85 ±0.91 |
+-----+----------------------+---------------------------+------------------------+

File diff suppressed because it is too large Load Diff

After

Width:  |  Height:  |  Size: 88 KiB