Adjust the Vitrage & Mistral use case to the new template format
Change-Id: Iae1bb26e3c6061f63ef4fac58e354c56cb32e91b
This commit is contained in:
parent
df25dec156
commit
6161217a88
@ -8,4 +8,4 @@ a starting point.
|
|||||||
:glob:
|
:glob:
|
||||||
:maxdepth: 1
|
:maxdepth: 1
|
||||||
|
|
||||||
use-cases/vitrage-mistral-integration.rst
|
use-cases/nic-failure-affects-instance-and-app.rst
|
||||||
|
101
use-cases/nic-failure-affects-instance-and-app.rst
Normal file
101
use-cases/nic-failure-affects-instance-and-app.rst
Normal file
@ -0,0 +1,101 @@
|
|||||||
|
..
|
||||||
|
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||||
|
License.
|
||||||
|
|
||||||
|
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||||
|
|
||||||
|
==============================================
|
||||||
|
NIC failure affects instances and applications
|
||||||
|
==============================================
|
||||||
|
|
||||||
|
As a cloud operator, whenever one of my cloud's compute nodes has a NIC
|
||||||
|
failure, I want to be notified of all affected resources including instances
|
||||||
|
and applications. Moreover, I want the failed instances to be migrated away to
|
||||||
|
another hardware so my applications will continue to function.
|
||||||
|
|
||||||
|
|
||||||
|
Problem description
|
||||||
|
===================
|
||||||
|
|
||||||
|
A NIC failure may cause the host, as well as all instances running on it, to
|
||||||
|
become unreachable. This may also affect applications that are using these
|
||||||
|
instances and lose their high-availability.
|
||||||
|
|
||||||
|
|
||||||
|
Fault class
|
||||||
|
===========
|
||||||
|
|
||||||
|
Network failure
|
||||||
|
|
||||||
|
|
||||||
|
OpenStack projects used
|
||||||
|
=======================
|
||||||
|
|
||||||
|
* Zabbix (or any other 3rd party monitor)
|
||||||
|
* Vitrage
|
||||||
|
* Mistral
|
||||||
|
|
||||||
|
|
||||||
|
Remediation class
|
||||||
|
=================
|
||||||
|
|
||||||
|
Reactive
|
||||||
|
|
||||||
|
|
||||||
|
Fault detection
|
||||||
|
===============
|
||||||
|
|
||||||
|
There is no OpenStack component that detects a NIC failure, so it has to be
|
||||||
|
done using a 3rd party monitor like Zabbix.
|
||||||
|
|
||||||
|
|
||||||
|
Inputs and decision-making
|
||||||
|
==========================
|
||||||
|
|
||||||
|
Based on the NIC failure detection, the cloud operator should understand which
|
||||||
|
resources and applications are affected.
|
||||||
|
|
||||||
|
|
||||||
|
Remediation
|
||||||
|
===========
|
||||||
|
|
||||||
|
Instances that became unreachable due the the network failure should be
|
||||||
|
migrated to another host, so the applications should continue to function.
|
||||||
|
|
||||||
|
|
||||||
|
Existing implementation(s)
|
||||||
|
==========================
|
||||||
|
|
||||||
|
To identify the failed resources, the cloud operator can use Vitrage. Vitrage
|
||||||
|
will be notified by the external monitor (such as Zabbix) about the failed NIC.
|
||||||
|
Based on its cloud topology awareness, Vitrage will raise additional alarms on
|
||||||
|
the host, instances and affected applications.
|
||||||
|
|
||||||
|
An affected application will most likely be running in HA mode, so it will
|
||||||
|
perform a fail-over to the standby instance. However, it will lose its
|
||||||
|
high-availability.
|
||||||
|
|
||||||
|
The cloud operator can see this information in Vitrage Entity Graph, locate
|
||||||
|
a failed instance that affects an application, and ask to execute a
|
||||||
|
VM-migration Mistral workflow on that instance.
|
||||||
|
|
||||||
|
Alternatively, Vitrage can **automatically** execute a Mistral workflow that
|
||||||
|
will migrate the failed instance to a different host, so the application will
|
||||||
|
get back to a fully-operational state.
|
||||||
|
|
||||||
|
.. figure:: ./vitrage_and_mistral.png
|
||||||
|
:scale: 100 %
|
||||||
|
:align: center
|
||||||
|
:alt: alternate text
|
||||||
|
|
||||||
|
|
||||||
|
Future work
|
||||||
|
===========
|
||||||
|
|
||||||
|
None (supported from OpenStack Queens and on)
|
||||||
|
|
||||||
|
|
||||||
|
Dependencies
|
||||||
|
============
|
||||||
|
|
||||||
|
None
|
@ -1,119 +0,0 @@
|
|||||||
===============================
|
|
||||||
Vitrage and Mistral Integration
|
|
||||||
===============================
|
|
||||||
|
|
||||||
Overview
|
|
||||||
========
|
|
||||||
|
|
||||||
Self-healing and fast recovery in real world cloud systems is challenging...
|
|
||||||
|
|
||||||
* Failures happen in real distributed systems
|
|
||||||
* A single failure may affect many resources
|
|
||||||
* We can see symptoms but it’s hard to find the root cause
|
|
||||||
* Recovery might be complicated
|
|
||||||
|
|
||||||
The integration of Vitrage and Mistral can help identifying the root cause and
|
|
||||||
taking corrective actions, in an end-to-end self-healing scenario.
|
|
||||||
|
|
||||||
Vitrage is the OpenStack Root Cause Analysis service for organizing, analyzing
|
|
||||||
and visualizing OpenStack and external alarms. It is used to provide insights
|
|
||||||
about the root cause of problems and deduce their existence before they are
|
|
||||||
directly reported.
|
|
||||||
|
|
||||||
Mistral is the OpenStack workflow service. It aims to provide a mechanism to
|
|
||||||
define tasks and workflows without writing code, manage and execute them in the
|
|
||||||
cloud environment.
|
|
||||||
|
|
||||||
|
|
||||||
Use Cases
|
|
||||||
=========
|
|
||||||
|
|
||||||
The integration of Vitrage with Mistral supports two kinds of use cases:
|
|
||||||
|
|
||||||
* Automatic workflow execution, based on predefined conditions
|
|
||||||
* Manual workflow execution from Vitrage Entity Graph (WIP in Rocky)
|
|
||||||
|
|
||||||
|
|
||||||
Use Case 1: NIC failure causes automatic instance migration
|
|
||||||
-----------------------------------------------------------
|
|
||||||
|
|
||||||
*"As a cloud operator, whenever one of my cloud's compute nodes has a NIC
|
|
||||||
failure, I want to be notified of all affected resources including instances
|
|
||||||
and applications. Moreover, I want the failed instances to be migrated away to
|
|
||||||
another hardware."*
|
|
||||||
|
|
||||||
In a complex system, a failure in one resource can have a wide effect on other
|
|
||||||
resources. One example is a NIC failure, that may cause the host, as well as
|
|
||||||
all instances running on it, to become unreachable. This may also affect
|
|
||||||
applications that are using these instances and lose their high-availability.
|
|
||||||
|
|
||||||
To identify the failed resources, the cloud operator can use Vitrage. Vitrage
|
|
||||||
will be notified by an external monitor (such as Zabbix) about the failed NIC.
|
|
||||||
Based on its cloud topology awareness, Vitrage will raise additional alarms on
|
|
||||||
the host, instances and affected applications.
|
|
||||||
|
|
||||||
An affected application will most likely be running in HA mode, so it will
|
|
||||||
perform a fail-over to the standby instance. However, it will lose its
|
|
||||||
high-availability. In order to fix it, Vitrage can execute a Mistral workflow
|
|
||||||
that will migrate the failed instance to a different host, so the application
|
|
||||||
will get back to a fully-operational state.
|
|
||||||
|
|
||||||
.. figure:: ./vitrage_and_mistral.png
|
|
||||||
:scale: 100 %
|
|
||||||
:align: center
|
|
||||||
:alt: alternate text
|
|
||||||
|
|
||||||
Use Case 2: NIC failure with an optional manual instance migration
|
|
||||||
------------------------------------------------------------------
|
|
||||||
|
|
||||||
*"As a cloud operator, whenever one of my cloud's compute nodes has a NIC
|
|
||||||
failure, I want to be notified of all affected resources including instances
|
|
||||||
and applications. I then want an easy way to manually migrate a failed
|
|
||||||
instance to another compute and track its state."*
|
|
||||||
|
|
||||||
This is currently WIP in Rocky.
|
|
||||||
|
|
||||||
The use case is similar to use case 1, but in this use case the cloud operator
|
|
||||||
did not pre-configured Vitrage to execute a Mistral workflow when an
|
|
||||||
application is affected by an instance being unreachable.
|
|
||||||
|
|
||||||
As a result of a NIC failure, Vitrage raises alarms on the host, its instances
|
|
||||||
and the applications that are using them. The cloud operator can see this
|
|
||||||
information in Vitrage Entity Graph, locate a failed instance that affects an
|
|
||||||
application, and ask to execute a VM-migration Mistral workflow on that
|
|
||||||
instance.
|
|
||||||
|
|
||||||
|
|
||||||
Technical Details
|
|
||||||
=================
|
|
||||||
|
|
||||||
Vitrage ``evaluator templates`` define the business logic and the way that
|
|
||||||
Vitrage handles alarms and resource states. A template contains ``scenarios``,
|
|
||||||
where each scenario is made of ``condition`` and ``actions``.
|
|
||||||
|
|
||||||
Among other actions (like raise an alarm or modify the state of a resource),
|
|
||||||
the cloud operator can ask to execute a Mistral workflow with certain
|
|
||||||
parameters. For example, the cloud operator can define this scenario:
|
|
||||||
|
|
||||||
* ``condition:`` an application contains an instance that is unreachable
|
|
||||||
* ``action:`` execute a Mistral VM-Migration workflow on that instance
|
|
||||||
|
|
||||||
More details about Vitrage template definitions can be found here_
|
|
||||||
|
|
||||||
.. _here: https://docs.openstack.org/vitrage/latest/contributor/vitrage-template-format.html
|
|
||||||
|
|
||||||
|
|
||||||
Note that Vitrage could call Nova evacuate directly for the failed instance,
|
|
||||||
but using a Mistral workflow is a much more robust option. Mistral can track
|
|
||||||
the Nova evacuation process, check its status and verify that everything worked
|
|
||||||
as expected.
|
|
||||||
|
|
||||||
|
|
||||||
References
|
|
||||||
==========
|
|
||||||
|
|
||||||
- https://www.openstack.org/videos/sydney-2017/advanced-fault-management-with-vitrage-and-mistral
|
|
||||||
|
|
||||||
- https://wiki.openstack.org/wiki/Vitrage
|
|
||||||
|
|
||||||
- https://docs.openstack.org/mistral/latest/
|
|
Loading…
Reference in New Issue
Block a user