Merge "Add Eris whitepaper from Gautam"
This commit is contained in:
commit
8eb9196116
722
doc/source/eris/index.rst
Normal file
722
doc/source/eris/index.rst
Normal file
@ -0,0 +1,722 @@
|
||||
===============================================
|
||||
OpenStack Eris - an extreme testing framework
|
||||
===============================================
|
||||
|
||||
.. contents::
|
||||
:depth: 2
|
||||
:local:
|
||||
|
||||
|
||||
Introduction
|
||||
============
|
||||
|
||||
OpenStack has been expanding at a breakneck pace. Its adoption has
|
||||
been phenomenal and it is currently the go to choice for on premise
|
||||
cloud IaaS software. From a software development perspective OpenStack
|
||||
today has approximately *nLines* of code contributed by thousands
|
||||
developers, reviewers and PTLs. There are new *mNewProjects* each year
|
||||
and *kBlueprints* under review. Taking a look at its adoption
|
||||
perspective, OpenStack clouds today power *nCPUs* cores of processors
|
||||
in *nCompanys* companies. The installations handle a variety of
|
||||
traffic anywhere from simple web hosting to extremely resource and SLA
|
||||
intensive workloads like telecom virtual network functions (VNFs) and
|
||||
scientific computing.
|
||||
|
||||
A commonly heard theme with regards to this rapid expansion in both,
|
||||
installed footprint and the OpenStack software project, is resiliency
|
||||
and performance. More specifically the questions asked are:
|
||||
|
||||
- What are the resiliency and performance characteristics of OpenStack
|
||||
from a control and data plane perspective?
|
||||
|
||||
- What sort of performance metrics can be achieved with a specific
|
||||
architecture?
|
||||
|
||||
- How resilient is the architecture to failures?
|
||||
|
||||
- How much resource scale can be achieved?
|
||||
|
||||
- What level of concurrency can resource operations handle?
|
||||
|
||||
- How operationally ready is a particular OpenStack installation?
|
||||
|
||||
- How do new releases compare to the older ones with regards to the
|
||||
above questions?
|
||||
|
||||
OpenStack Eris is an extreme testing framework and test suite that
|
||||
proposes to stress OpenStack in various different ways to address
|
||||
performance and resiliency questions about OpenStack. Eris comes out
|
||||
of `the LCOO working group <https://wiki.openstack.org/wiki/LCOO>`_'s
|
||||
efforts to derive holistic performance, reliability and availability
|
||||
characteristics for OpenStack installations at the release/QA
|
||||
gates. In addition, Eris also aims to provide capabilities for third
|
||||
party CI’s and other open source communities like OpenContrail,
|
||||
etc. to execute and publish similar characteristics.
|
||||
|
||||
|
||||
Goals and Benefits
|
||||
==================
|
||||
|
||||
The major objective of the project has been outlined in the previous
|
||||
section. To reiterate here: derive holistic performance, reliability and
|
||||
availability characteristics for OpenStack. Figure 1 below translates
|
||||
the breakup of this objective into specific goals to achieve that
|
||||
objective. The aim of this section is to discuss in fairly abstract
|
||||
terms these goals without diving into actual implementation details.
|
||||
|
||||
+-----------------------------+
|
||||
| |image0| |
|
||||
+=============================+
|
||||
| **Figure 1: Goals of Eris** |
|
||||
+-----------------------------+
|
||||
|
||||
Eris has three major goals that derive from its primary goal of deriving
|
||||
holistic performance, reliability and availability characteristics of
|
||||
OpenStack. Each of the major goals and their sub-goals are discussed in
|
||||
detail below.
|
||||
|
||||
Goal 1: Requirements
|
||||
--------------------
|
||||
|
||||
Define infrastructure architecture, realistic
|
||||
workloads for that architecture and reference KPI/SLO valid for that
|
||||
architecture.
|
||||
|
||||
- **Reference architecture(s):** Performance and resiliency
|
||||
characteristics of a system are valid for specific architectures
|
||||
they are configured for. Hence, one of our first goals is to define
|
||||
reference architectures on which tests will be run.
|
||||
|
||||
- **Reference workload(s):** When pursuing the assessment of
|
||||
performance and resiliency we should ensure that it is done under
|
||||
well-defined workloads. These workloads should be modeled on either
|
||||
normal or stressful situations that happen in real data
|
||||
centers. Unrealistic workloads skew results and provide data that is
|
||||
not useful.
|
||||
|
||||
- **Reference KPI/SLO(s):** The type of testing that Eris proposes is
|
||||
non-deterministic, i.e. performance or resiliency cannot be
|
||||
determined by the success or failure of a single transaction.
|
||||
Performance and resiliency are generally determined by using
|
||||
aggregates of certain metrics (e.g. percent success rate, mean
|
||||
transaction response times, mean time to recover, etc.) for a set of
|
||||
transactions run over an extended time period. These aggregate
|
||||
metrics are the Key Performance Indicators (KPIs) or Service Level
|
||||
Objectives (SLOs) of the test. These metrics need to be defined
|
||||
since they are will determine the pass/fail criteria for the
|
||||
testing.
|
||||
|
||||
Goal 2: Frameworks
|
||||
------------------
|
||||
|
||||
Define the elements of an extreme testing framework that encompasses
|
||||
the ability to create repeatable experiments, test creation, test
|
||||
orchestration, extensibility, automation and capabilities for
|
||||
simulation and emulation. The Eris framework is not tightly coupled to
|
||||
the test suite or the requirements. This leaves it flexible for other
|
||||
general purpose use like VNF testing as well.
|
||||
|
||||
- **Repeatable experiments:** For non-deterministic testing, the
|
||||
ability to create repeatable experiments is paramount. Such a
|
||||
capability allows parameters to be consistently verified within the
|
||||
KPI/SLO limits.
|
||||
|
||||
- **Test Creation:** Ease of test creation is a basic facility that
|
||||
should be provided by the framework. A test should be specified using
|
||||
an open specification and require minimal development (programming).
|
||||
It should maximize the capability for re-use between already
|
||||
developed components and test cases.
|
||||
|
||||
- **Test Orchestration:** Facilities for test orchestration should be
|
||||
provided by the framework. Test orchestration can span various layers
|
||||
of the reference architecture. The test orchestration mechanism
|
||||
should be able to orchestrate for the reference workloads and
|
||||
failures on the reference architecture and measure the reference
|
||||
KPI/SLO.
|
||||
|
||||
- **Extensibility:** The framework should be extensible at all layers.
|
||||
This means the plugin should be designed using a plugin/driver model
|
||||
with a significantly flexible specification to accomplish this goal.
|
||||
|
||||
- **Automation:** The entire test suite should be automated. This
|
||||
includes orchestrating various steps of the test along with computing
|
||||
a success/failure of the test based on the KPI/SLO supplied. This
|
||||
also explicitly means that good mathematics will be needed. There
|
||||
shouldn’t be eyeballing graphs to see if KPI are met or not met.
|
||||
|
||||
- **Simulation and Emulation:** Any framework that does performance and
|
||||
resiliency needs to have efficient and effective simulation and
|
||||
emulation mechanisms. These are especially useful to run experiments
|
||||
on constrained environments. Examples include – how would we know if
|
||||
OpenStack control plane components are ready for 5000 compute node
|
||||
scale? It is not possible to acquire that kind of hardware. So,
|
||||
testing will eventually need robust simulation and emulation
|
||||
components.
|
||||
|
||||
Goal 3: Test Suite
|
||||
------------------
|
||||
|
||||
The test suite is the actual set of tests that are run by the
|
||||
framework on the reference architecture with the reference workload
|
||||
and faults specified. The end result is to derive the metrics related
|
||||
to performance, reliability and availability.
|
||||
|
||||
- **Control Plane Performance:** This test suite will be responsible to
|
||||
run the reference API workload on various OpenStack components.
|
||||
|
||||
- **Data Plane Performance:** This test suite will be responsible to
|
||||
run the reference data plane workload. The expectation is that data
|
||||
and control plane performance workloads are run together to get a
|
||||
feel for realistic traffic in an install OpenStack environment.
|
||||
|
||||
- **Resiliency to Failure:** The test suites at either random or
|
||||
imperative points will inject failures into the system at various
|
||||
levels (hardware, network, etc.). The failure types could be simple
|
||||
or compounded failures. The KPI’s published will also include details
|
||||
on how OpenStack reacts and recovers from these failures.
|
||||
|
||||
- **Resource scale limits:** This test suite will seek to identify
|
||||
limits of resource scale. Examples are – how many VMs can be created,
|
||||
how many networks, how many cinder volumes, how many volumes per VM,
|
||||
etc.? The test suite will also track performance of various
|
||||
components as and how the resources are scaled. There isn’t an
|
||||
expectation of high concurrency for these tests. The primary goal
|
||||
being to flush out various “limits” as defined but not explicitly
|
||||
specified either by OpenStack or components it uses.
|
||||
|
||||
- **Resource concurrency limits:** This test suite will seek to
|
||||
identify limits of resource concurrency. Examples are – how many
|
||||
concurrent modifications can be made on a network, a subnet, a port,
|
||||
etc. As with resource scale limits, resources will need to be
|
||||
identified and concurrent transactions will need to be run against
|
||||
single resources. The test suite will track performance of various
|
||||
components during the test.
|
||||
|
||||
- **Operational readiness:** It is often times not feasible to run an
|
||||
entire gamut of long running tests as identified above. What is
|
||||
needed either for production readiness testing or for QA gates is a
|
||||
smoke test that signifies operational readiness. It is the minimal
|
||||
criteria needed to declare a code change good or a site healthy. The
|
||||
test suite will contain a “smoke test” for performance, reliability
|
||||
and availability labelled as its operational readiness test.
|
||||
|
||||
Review of Existing Projects
|
||||
===========================
|
||||
|
||||
There has been a lot of work put in disparate projects, some successful
|
||||
and some that aren’t that well known into building tools and creating
|
||||
test suites for measuring OpenStack performance, reliability and
|
||||
availability. This section will review these projects with our goals in
|
||||
perspective and provide an analysis of the tools we intend to use.
|
||||
|
||||
Summary of Projects
|
||||
-------------------
|
||||
|
||||
OpenStack/Rally
|
||||
~~~~~~~~~~~~~~~
|
||||
|
||||
`Rally <https://docs.openstack.org/developer/rally/>`_ is currently
|
||||
the choice for control plane performance testing. It has a flexible
|
||||
architecture with a plugin mechanism that can be extended. It has a
|
||||
wide base of existing plugins for OpenStack scenarios and this base
|
||||
keeps on expanding. Most performance testing of OpenStack today uses
|
||||
Rally. The benchmarks it provides today are mostly related to success
|
||||
rate of the transactions and response times as it is only aware of
|
||||
what is happening on the client side of the transaction. There is
|
||||
scope for failure injection scenarios using an os-faults hook with
|
||||
triggers.
|
||||
|
||||
OpenStack/Shaker
|
||||
~~~~~~~~~~~~~~~~
|
||||
|
||||
`Shaker <https://opendev.org/performa/shaker>`_ is
|
||||
currently the popular choice for data plane network performance
|
||||
testing. It has a custom built image with agents and iperf/iperf3
|
||||
toolsets along with a wide array of heat templates to instantiate a
|
||||
topology. Shaker also provides various methods to measure metrics and
|
||||
enforce SLA of the tests.
|
||||
|
||||
OpenStack/os-faults
|
||||
~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The failure injection mechanism used within Rally and one that can
|
||||
also be used independently is `os-faults
|
||||
<https://opendev.org/performa/os-faults>`_. It consists of a CLI and
|
||||
library. It currently contains failure injections that can be run at
|
||||
either a hardware or a software level. Software failure injections are
|
||||
network and process failures while hardware faults are via IPMI to
|
||||
servers. Information about a site can be discovered via pre-defined
|
||||
drivers (fuel, tcpcloud, etc.) or provided directly via a JSON
|
||||
configuration file. The set of drivers can be extended by developers
|
||||
for more automated discovery mechanisms.
|
||||
|
||||
Cisco/cloud99
|
||||
~~~~~~~~~~~~~
|
||||
|
||||
`Cloud99 <https://github.com/cisco-oss-eng/Cloud99>`_ is Cisco open
|
||||
source to probe high availability deployments of OpenStack. It
|
||||
consists primarily of software the runs load on the control and data
|
||||
plane, injects service disruptions and measures metrics. The load
|
||||
runner for the control plane is a wrapper around OpenStack
|
||||
Rally. There doesn’t seem to be a data plane load runner implemented
|
||||
at this point in time. The metrics gathering is via Ansible/SSH and
|
||||
the service disruptors use Paramiko/SSH to induce disruptions.
|
||||
|
||||
Other Efforts
|
||||
~~~~~~~~~~~~~
|
||||
|
||||
There have been several other efforts that use some combination of the
|
||||
tools mentioned above with custom frameworks to achieve in part some
|
||||
of the objectives that have been set for Eris. Notable work includes:
|
||||
|
||||
- an Intel destructive scenario report using Rally and os-faults,
|
||||
|
||||
- `the Mirantis Stepler framework
|
||||
<https://github.com/Mirantis/stepler>`_ that uses os-faults for
|
||||
failure injection, and
|
||||
|
||||
- `the OSIC's ops-workload-framework
|
||||
<https://github.com/osic/ops-workload-framework>`_.
|
||||
|
||||
Most of this work focuses on control plane performance combined with
|
||||
failure injection.
|
||||
|
||||
`The ENoS framework <https://github.com/BeyondTheClouds/enos>`_
|
||||
combines Rally with a deployment of containerized OpenStack to
|
||||
generate repeatable performance experiments.
|
||||
|
||||
Gap Analysis
|
||||
------------
|
||||
|
||||
This section provides a gap analysis of the above tools with regards to
|
||||
the goals of Eris. The purpose here is not to rule out or exclude the
|
||||
tools from use in Eris. To the contrary, it is to identify the strengths
|
||||
of the existing toolset and investigate where Eris needs to focus its
|
||||
efforts.
|
||||
|
||||
Requirements Gaps
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
One of the major gaps identified above is the focus on frameworks at the
|
||||
cost of a reference requirements. For any non-deterministic testing
|
||||
mechanism that focuses on performance, reliability and availability the
|
||||
underlying architecture, workloads and SLOs are extremely important.
|
||||
Those are the references that give the numbers meaning. It is not that
|
||||
the frameworks are secondary, but in the absence of the reference
|
||||
requirements, numbers from frameworks and test suites are hard to
|
||||
interpret and use. There are also specific gaps with framework and test
|
||||
suites that are outlined below.
|
||||
|
||||
Framework Gaps
|
||||
~~~~~~~~~~~~~~
|
||||
|
||||
**Repeatable Experiments:** *ENOS* is the only tool that is geared
|
||||
towards generating repeatable performance experiments. However, it is
|
||||
only valid for container deployments. There are various other deployment
|
||||
tools like Fuel, Ansible, etc. but none that integrate deployment with
|
||||
various test suites.
|
||||
|
||||
**Test Creation:** Rally is the de-facto in Control Plane performance
|
||||
test specification. Most tools and efforts around performance and
|
||||
failure injection of OpenStack have leveraged Rally – including Cloud99
|
||||
and ENOS. Shaker is popular for network load generation and provides a
|
||||
fairly good suite of out of the box templates for creating and
|
||||
benchmarking various types of tenant network load. Although both tools
|
||||
are extensible, there are major gaps with regards to specifying combined
|
||||
control and data plane workloads – like a real IaaS would have. The gaps
|
||||
include scenarios like I/O loads, network BGP loads, DPDK, CPU, memory
|
||||
in the data plane. They include multi-scenario and distributed workload
|
||||
generation in Rally. For failure injection specifications, Shaker
|
||||
supports no failure injections. Rally supports single failure injections
|
||||
via the os-faults library with the deterministic triggers (at specific
|
||||
iteration points or times).
|
||||
|
||||
**Test Orchestration:** There are no tools today that support
|
||||
distributed test orchestration. None of the tools analyzed above have
|
||||
the ability to deploy a test suite to multiple
|
||||
nodes/locations/containers, etc. and orchestrate and manage a test.
|
||||
Further – integrating such capability into these tools would involve
|
||||
some major re-architecture and refactoring [addRef-RallyRoadmap]. The
|
||||
test orchestration SLA specifications today are fairly disparate for
|
||||
control and data plane and they lack a uniform mechanism to add new
|
||||
counters and metrics especially from Control Plane hosts or compute
|
||||
hosts. Ansible seems to be used primarily as a crutch for SSH while
|
||||
ignoring the many capabilities of Ansible that can actually solve the
|
||||
various gaps.
|
||||
|
||||
**Extensibility:** Most tools surveyed are extensible for the simpler
|
||||
changes – i.e. more failure injection scenarios, randomized triggers,
|
||||
new API call scenarios, etc. However, the bigger changes seem to need
|
||||
some fairly extensive changes. Examples includes various items in the
|
||||
Rally roadmap that are blocked by a major refactoring effort. Shaker
|
||||
also today doesn’t seem to have a failure injection mechanism plugged in
|
||||
addition to not having other data plane load generation
|
||||
tools/capabilities. They definitely do not support plugins to interface
|
||||
with other third party (or proprietary) tools and make the integration
|
||||
of different performance collection and computation counters difficult.
|
||||
|
||||
**Automation:** While there is a fair amount of thought paid today to
|
||||
test setup and test orchestration automation, there is not a lot of work
|
||||
in automating the success and failure criteria based on certain SLO.
|
||||
Rally and Shaker both incorporate specific SLA verification mechanisms
|
||||
but both are limited. Shaker is limited by what is observed on the guest
|
||||
VMs and Rally by the API response times and success rates. The overall
|
||||
health of an IaaS installation will require many more counters with more
|
||||
complex mathematics needed to calculate metrics and verify the systems
|
||||
capability to satisfy SLO.
|
||||
|
||||
**Simulation and Emulation:** Any major extreme testing framework is
|
||||
never complete without competent simulators and emulators. There needs
|
||||
to be the capability to test scale without actually having the scale. It
|
||||
is especially important for an IaaS system. As an example take the case
|
||||
of scaling an OpenStack cloud to 5000 compute nodes. Is it possible?
|
||||
Probably not. However, to test software changes to make it possible
|
||||
requesting 5000 actual computes is unrealistic. This is a major gap
|
||||
today in OpenStack with no mechanisms to test scale or resiliency
|
||||
without having “real” data centers. The only thing that comes close is
|
||||
the RabbitMQ simulator in OpenStack/oslo.
|
||||
|
||||
Test Suite Gaps
|
||||
~~~~~~~~~~~~~~~
|
||||
|
||||
**Control & Data Plane Performance:** Rally contains single scenarios
|
||||
for performance testing which sample loads. Shaker contains various heat
|
||||
templates for sample configurations. Neither can be classified as a test
|
||||
suite where OpenStack runs and publishes performance related numbers.
|
||||
Again, the limitation of not having multi-scenarios and distributed
|
||||
workloads will come into play as performance numbers need to be run for
|
||||
larger clouds. In such situations, workloads where only a single
|
||||
machine/client is running orchestration may not be viable.
|
||||
|
||||
**Resiliency to Failure:** There are currently no test suites that
|
||||
measure resiliency to failure. While an os-faults plugin exists in Rally
|
||||
the library itself if out of maintenance today. There are no scenarios
|
||||
of failures to the data plane. There has been an effort to identify
|
||||
points of failure and types of failure along with executing failure
|
||||
scenarios [AddRef-Intelos-faults]. However, these scenarios are run with
|
||||
single rally workloads and its assertion that the traffic represents
|
||||
real traffic seems unrealistic.
|
||||
|
||||
**Resource Scale & Concurrency Limits:** There are currently no test
|
||||
suites that probe these limits. They are generally uncovered when
|
||||
unsuspecting (or over enthusiastic) tenants try something complete way
|
||||
out of what is “ordinary” and the operation fails. They typically end up
|
||||
as bug reports and are investigated and fixed. What is needed is a
|
||||
proactive mechanism to probe and uncover these limits.
|
||||
|
||||
**Operational Readiness:** There is currently no step in the OpenStack
|
||||
QA workflow today that can take a reference architecture, reference
|
||||
workload, reference KPI and run a battery of smoke tests that cover the
|
||||
test suites mentioned in the points above. These smoke or “operational
|
||||
readiness” tests are needed to ensure that fixes and changes to
|
||||
components are not adversely impacting its performance, reliability and
|
||||
availability. This does go back to fixing the gaps that such a test
|
||||
would need at the QA gates, but once that gap if fixed such tests should
|
||||
be a part of the workflow.
|
||||
|
||||
Eris Architecture
|
||||
=================
|
||||
|
||||
Eris is architected to achieve the goals listed in Section 2. This
|
||||
section specifies the basic components of Eris and the Eris QA workflow.
|
||||
The idea is to get Eris down to an abstract framework that can be then
|
||||
extended and implemented using a variety of tools. The QA workflow will
|
||||
identify what points to run Eris.
|
||||
|
||||
Eris Framework
|
||||
--------------
|
||||
|
||||
+------------------------------+
|
||||
| |image1| |
|
||||
+==============================+
|
||||
| **Figure 2: Eris Framework** |
|
||||
+------------------------------+
|
||||
|
||||
As depicted in Figure 2, the proposed Eris architecture is modular. The
|
||||
dark blue boxes denote existing OpenStack systems that developers and
|
||||
the community use. The CI/CD infrastructure will be responsible for
|
||||
scheduling and invoking the testing. Tests that fail SLA/KPI criteria
|
||||
will have bugs created for them in the ticketing system and the
|
||||
community developers can create either tests targeted to their
|
||||
components or tests that are cross-component.
|
||||
|
||||
**Test Manager:** The responsibility of the test manager is to invoke
|
||||
test suite orchestration, interfacing with the bugs and ticketing
|
||||
systems, storing logs and data for future reference. The underlying
|
||||
orchestration layer and orchestration plugins all pipe data and logs
|
||||
into the test manager.
|
||||
|
||||
**Orchestration:** The responsibility of the orchestration component is
|
||||
to run a test scenario that can include deployment, discovery, load
|
||||
injection, failure injection, monitoring, metrics collection and KPI
|
||||
computation. The orchestration engine should be able to take an open
|
||||
specification and turn it into concrete steps that execute the test
|
||||
scenario. The orchestration engine itself may not be the tool that runs
|
||||
all the scenarios.
|
||||
|
||||
**Zone Deployment:** The zone deployment plugin will take a reference
|
||||
architecture specification and deploy an OpenStack installation that
|
||||
complies with that reference architecture. It will also take various
|
||||
reference workload and metrics collections specifications and deploy the
|
||||
test tools in with the distribution specified. When the orchestrator
|
||||
deploys an architecture based on a specification it will not need to
|
||||
discover the zone.
|
||||
|
||||
**Zone Discovery:** In the event that the orchestration plugin operates
|
||||
on an existing deployment it will need to discover the various
|
||||
components of the reference architecture it is installed on. This will
|
||||
be the responsibility of the zone discovery plugin. The zone discovery
|
||||
plugin should also eventually be able to recognize a reference
|
||||
architecture, although initially this capability may be complex to
|
||||
incorporate.
|
||||
|
||||
**Control Plane Load Injection:** This plugin is responsible for setting
|
||||
up and running the control plane load injection. The setup may include a
|
||||
distributed multi-scenario load injection to mimic actual load into an
|
||||
OpenStack IaaS installation depending on the reference workload. Running
|
||||
load should be flexible enough to tune the load models across various
|
||||
distributed nodes and specify ramp-up, ramp-down and sustenance models.
|
||||
This plugin will run OpenStack API into the control plane services and
|
||||
depending on the scenarios executed may need admin access to the zone.
|
||||
|
||||
**Data Plane Load Injection:** This plugin is responsible for setting up
|
||||
various data plane load injection scenarios and running them. As with
|
||||
the control plane load injection this can include a distributed
|
||||
multi-scenario setup to mimic actual traffic depending on the reference
|
||||
workload. While in the case of the control plane, the setup may include
|
||||
something like creating a Rally deployment, in the data plane load
|
||||
injection scenario it will be setting up tenant resources to run stress
|
||||
on the data plane. Again, as with the control plane load injection, load
|
||||
will need to be distributed across various nodes and be tunable to
|
||||
ramp-up, ramp-down and sustenance models. Stress types should include
|
||||
storage I/O, network, CPU and memory at a minimum.
|
||||
|
||||
**Failure Injection:** The failure injection plugin will be responsible
|
||||
to inject failure into various parts of the reference architecture. The
|
||||
failures could be simple failures or compound failures. The injection
|
||||
interval can be either deterministic, i.e. based at a certain time or
|
||||
workload iteration point, randomized or event driven, i.e. based on when
|
||||
certain events are happening in the control or data plane. The nature of
|
||||
the failure injection plugin demands that it have root access (or sudo
|
||||
root) across every component in the reference architecture and tenant
|
||||
space.
|
||||
|
||||
**Data Collection & KPI Computation:** Plugins for data collection and
|
||||
SLA computation will collect various counters from API calls, tenant
|
||||
space and the underlying reference architecture. Based on the matrix of
|
||||
counters at various resource points and formulas supplied for KPI that
|
||||
operate on this matrix, key process indicators (KPI) values are
|
||||
computed. These KPI are then compared against the reference service
|
||||
level objectives for the reference architecture and reference workload
|
||||
combination to provide a pass/fail for the test. Hence, this plugin is
|
||||
the final arbiter in whether the scenario passes or fails.
|
||||
|
||||
Eris Workflow
|
||||
-------------
|
||||
|
||||
+--------------------------------+
|
||||
| |image2| |
|
||||
+================================+
|
||||
| **Figure 3: Eris QA Workflow** |
|
||||
+--------------------------------+
|
||||
|
||||
Apart from the actual Eris framework that is expected to execute the
|
||||
tests, there is a component of Eris that needs to reside in the QA
|
||||
framework. This actually has three major components identified.
|
||||
|
||||
**CI/CD Integration:** Eris test suites need to be integrated into the
|
||||
CI/CD workflow. Test suite runs need to be tagged, the results archived
|
||||
and bugs generated. Initially, there may be the capacity for all Eris
|
||||
tests to be run. However, as and how the library of test suites and
|
||||
reference architectures becomes more complex the gate QA will need to
|
||||
rely on a smoke test/operational readiness test. Initially, the
|
||||
identification of what constitutes a reasonable smoke test will have to
|
||||
be done manually. However, there should be an evolution to automatically
|
||||
identify a set of smoke tests that can be reasonably handled at the
|
||||
CI/CD gates.
|
||||
|
||||
**Test Frequency:** The tests that Eris proposes to run are long running
|
||||
tests. It may not be practical to run them at every code check-in. The
|
||||
workflow proposal is for the smoke tests to be run one a day and an
|
||||
operational readiness suite to be run one every week. This party CI’s
|
||||
can rely on more exhaustive testing that can run into multiple days.
|
||||
|
||||
**Bug Reporting:** The reporting of bugs for Eris can be tricky. Bugs
|
||||
are generated when analyzed KPI from the tests fail to meet defined
|
||||
reference SLO’s. However, these bugs need to be reproducible. The
|
||||
question becomes how many times should a test run before a KPI miss is
|
||||
considered a bug? This is an open question that will consist of some
|
||||
fairly hard mathematics to solve. It may depend on several states in the
|
||||
system and reproducing specific conditions may not be possible every
|
||||
time. A good approach to take is to create a bug but attach a frequency
|
||||
tag to the bug. As and how KPI’s keep missing reference objectives a
|
||||
frequency tag is incremented. The frequency tag can be attached to the
|
||||
criticality of the bug and every 10 counts of a frequency tag can result
|
||||
in the criticality of the bug being bumped up.
|
||||
|
||||
Eris Design
|
||||
===========
|
||||
|
||||
This is the thinnest section by far in the document since not all the
|
||||
parts of Eris have been thought about. It is good in a sense because it
|
||||
provides a lot of opportunity for the community to fine tune the project
|
||||
to its needs. There has been a fair amount of thought put forth on the
|
||||
tools to be used and some of the enhancements that are needed. The main
|
||||
focus of the design here will be to focus on a specification and
|
||||
tools/libraries. The specification can then be broken up into specific
|
||||
roadmap items for Queens and beyond. Keep in mind that the tools and
|
||||
libraries will most certainly need changes that will extend their
|
||||
current capabilities.
|
||||
|
||||
Design Components
|
||||
-----------------
|
||||
|
||||
+--------------------------------------------------------+
|
||||
| |image3| |
|
||||
+========================================================+
|
||||
| **Figure 4: Eris Implementation Components (Partial)** |
|
||||
+--------------------------------------------------------+
|
||||
|
||||
The general idea is to use Ansible to orchestrate the various test
|
||||
scenarios. Ansible is python based and therefore will fit well into the
|
||||
OpenStack community. It also has a variety of plugins already available
|
||||
to orchestrate different scenarios. New plugins can be easily created
|
||||
for specific scenarios that are needed for OpenStack Eris.
|
||||
|
||||
The use of Ansible will result in the following major benefits for the
|
||||
project:
|
||||
|
||||
- Decoupling of the orchestration (Ansible) and execution (Rally,
|
||||
Shaker, etc.).
|
||||
|
||||
- Extensive use of existing Ansible plugins for installation and
|
||||
distributed orchestration of software.
|
||||
|
||||
- Well documented and open source tool for extending and expanding the
|
||||
use of Eris.
|
||||
|
||||
- Agentless execution since agents and tools require extra installation
|
||||
but rarely bring benefits for testing.
|
||||
|
||||
As can be seen from the proposed design above Eris does not exclude the
|
||||
use of already existing tools for performance and failure injection
|
||||
testing. In fact the use of Ansible as the orchestration mechanism
|
||||
provides an incentive for re-using them.
|
||||
|
||||
The other benefit of using Ansible is the ability to include plug-ins
|
||||
for third party proprietary tools with operators and companies
|
||||
developing their own plugins that confirm to the Eris specification. As
|
||||
an example, an operator may use HP Performance Center as a performance
|
||||
testing tool, HP SiteScope for gathering metrics and IXIA for BGP load
|
||||
generation. These could be private plugins for the operator to generate
|
||||
specific load components and gather metrics while still using large
|
||||
parts of Eris to discover, inject faults and compute KPI.
|
||||
|
||||
Deployment
|
||||
----------
|
||||
|
||||
Roadmap Item – for the community to specify.
|
||||
|
||||
Discovery
|
||||
---------
|
||||
|
||||
The discovery mechanism can use any tool to discover the environment. It
|
||||
can be read from a file, use Fuel or Kubernetes, etc. However, in the
|
||||
end the discovery mechanism should confirm to an Ansible dynamic
|
||||
inventory that provides a structure that describes the site. The
|
||||
description of the site can be expanded. However, the underlying load
|
||||
injection mechanisms and metrics gathering mechanisms will depend on
|
||||
this data. In short, the reference workload, failure injection and the
|
||||
metrics gathering cannot see what the discovery cannot provide. So, if
|
||||
initially the discovery provides only server and VM information those
|
||||
are the only resources that can be probed.
|
||||
|
||||
Ideally, a site is composed of the following components:
|
||||
|
||||
- Routers
|
||||
|
||||
- Switches
|
||||
|
||||
- Servers (Control & Compute)
|
||||
|
||||
- Racks
|
||||
|
||||
- VMs (or Containers)
|
||||
|
||||
- Orchestration services (Kubernetes, Ceph, Calico, etc.)
|
||||
|
||||
- OpenStack services and components (Rabbit, MariaDB, etc.)
|
||||
|
||||
Eris will need all details related to these components – specifically
|
||||
ssh keys, IP addresses, MAC addresses and any other variables that
|
||||
describe how to induce failure and stress. It is not possible to
|
||||
provide an entire specification considering the variety of
|
||||
installations. However, an example will be provided with the Queens
|
||||
roadmap.
|
||||
|
||||
Load Injection
|
||||
--------------
|
||||
|
||||
Control Plane
|
||||
~~~~~~~~~~~~~
|
||||
|
||||
The tool for control plane load injection is Rally. Rally is very well
|
||||
known in OpenStack and contains plenty of scenarios to stress the
|
||||
control plane. Rally does have some gaps with distributed workload
|
||||
generation and multi-scenario workloads. With respect to Eris, where the
|
||||
idea is to loosely couple components that make up a scenario, tight
|
||||
coupling with Rally is not desirable. Hence, Eris will use Rally single
|
||||
scenarios. However, Eris will use its own functions and methods for
|
||||
multi-scenario and distributed workload generation. Initially, Eris’
|
||||
focus will be on multi-scenario execution with distributed load
|
||||
generation closely following.
|
||||
|
||||
Data Plane
|
||||
~~~~~~~~~~
|
||||
|
||||
The tool for data plane load injection is Shaker. Shaker already has a
|
||||
custom image for iperf3 execution along with heat templates for
|
||||
deployment. Eris’ goals for Shaker exceed that already defined with
|
||||
Shaker and again there are some significant enhancements with Shaker
|
||||
that will need to be accomplished. A couple of primary enhancements may
|
||||
be the inclusion of various other data plane stress mechanisms and the
|
||||
use of an agentless mechanism using ssh (which Ansible has extensive use
|
||||
with) to control the load and gather metrics.
|
||||
|
||||
Fault Injection
|
||||
---------------
|
||||
|
||||
TODO
|
||||
|
||||
Metrics Gathering
|
||||
-----------------
|
||||
|
||||
TODO
|
||||
|
||||
SLA Computation
|
||||
---------------
|
||||
|
||||
TODO
|
||||
|
||||
Eris Roadmap
|
||||
============
|
||||
|
||||
TODO
|
||||
|
||||
Eris in Popular Literature
|
||||
==========================
|
||||
|
||||
TODO
|
||||
|
||||
.. |image0| image:: ./media/image1.jpg
|
||||
:width: 6.04097in
|
||||
:height: 3.13736in
|
||||
.. |image1| image:: ./media/image2.jpg
|
||||
:width: 6.36813in
|
||||
:height: 2.06361in
|
||||
.. |image2| image:: ./media/image3.png
|
||||
:width: 6.5in
|
||||
:height: 3.65625in
|
||||
.. |image3| image:: ./media/image4.jpg
|
||||
:width: 6.35165in
|
||||
:height: 2.10833in
|
BIN
doc/source/eris/media/image1.jpg
Normal file
BIN
doc/source/eris/media/image1.jpg
Normal file
Binary file not shown.
After Width: | Height: | Size: 42 KiB |
BIN
doc/source/eris/media/image2.jpg
Normal file
BIN
doc/source/eris/media/image2.jpg
Normal file
Binary file not shown.
After Width: | Height: | Size: 52 KiB |
BIN
doc/source/eris/media/image3.png
Normal file
BIN
doc/source/eris/media/image3.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 174 KiB |
BIN
doc/source/eris/media/image4.jpg
Normal file
BIN
doc/source/eris/media/image4.jpg
Normal file
Binary file not shown.
After Width: | Height: | Size: 48 KiB |
@ -16,6 +16,7 @@ Contributions to this documentation are warmly encouraged; please see
|
||||
|
||||
use-cases
|
||||
specs
|
||||
eris/index
|
||||
|
||||
|
||||
Indices and tables
|
||||
|
Loading…
Reference in New Issue
Block a user