Adjust the Vitrage & Mistral use case to the new template format

Change-Id: Iae1bb26e3c6061f63ef4fac58e354c56cb32e91b
2018-10-17 12:52:07 +00:00 · 2018-10-17 12:52:07 +00:00 · 6161217a88
commit 6161217a88
parent df25dec156
3 changed files with 102 additions and 120 deletions
--- a/doc/source/use-cases.rst
+++ b/doc/source/use-cases.rst
@ -8,4 +8,4 @@ a starting point.
   :glob:
   :maxdepth: 1

-   use-cases/vitrage-mistral-integration.rst
+   use-cases/nic-failure-affects-instance-and-app.rst
--- a/use-cases/nic-failure-affects-instance-and-app.rst
+++ b/use-cases/nic-failure-affects-instance-and-app.rst
@ -0,0 +1,101 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+==============================================
+NIC failure affects instances and applications
+==============================================
+
+As a cloud operator, whenever one of my cloud's compute nodes has a NIC
+failure, I want to be notified of all affected resources including instances
+and applications. Moreover, I want the failed instances to be migrated away to
+another hardware so my applications will continue to function.
+
+
+Problem description
+===================
+
+A NIC failure may cause the host, as well as all instances running on it, to
+become unreachable. This may also affect applications that are using these
+instances and lose their high-availability.
+
+
+Fault class
+===========
+
+Network failure
+
+
+OpenStack projects used
+=======================
+
+* Zabbix (or any other 3rd party monitor)
+* Vitrage
+* Mistral
+
+
+Remediation class
+=================
+
+Reactive
+
+
+Fault detection
+===============
+
+There is no OpenStack component that detects a NIC failure, so it has to be
+done using a 3rd party monitor like Zabbix.
+
+
+Inputs and decision-making
+==========================
+
+Based on the NIC failure detection, the cloud operator should understand which
+resources and applications are affected.
+
+
+Remediation
+===========
+
+Instances that became unreachable due the the network failure should be
+migrated to another host, so the applications should continue to function.
+
+
+Existing implementation(s)
+==========================
+
+To identify the failed resources, the cloud operator can use Vitrage. Vitrage
+will be notified by the external monitor (such as Zabbix) about the failed NIC.
+Based on its cloud topology awareness, Vitrage will raise additional alarms on
+the host, instances and affected applications.
+
+An affected application will most likely be running in HA mode, so it will
+perform a fail-over to the standby instance. However, it will lose its
+high-availability.
+
+The cloud operator can see this information in Vitrage Entity Graph, locate
+a failed instance that affects an application, and ask to execute a
+VM-migration Mistral workflow on that instance.
+
+Alternatively, Vitrage can **automatically** execute a Mistral workflow that
+will migrate the failed instance to a different host, so the application will
+get back to a fully-operational state.
+
+.. figure:: ./vitrage_and_mistral.png
+   :scale: 100 %
+   :align: center
+   :alt: alternate text
+
+
+Future work
+===========
+
+None (supported from OpenStack Queens and on)
+
+
+Dependencies
+============
+
+None
--- a/use-cases/vitrage-mistral-integration.rst
+++ b/use-cases/vitrage-mistral-integration.rst
@ -1,119 +0,0 @@
-===============================
-Vitrage and Mistral Integration
-===============================
-
-Overview
-========
-
-Self-healing and fast recovery in real world cloud systems is challenging...
-
-* Failures happen in real distributed systems
-* A single failure may affect many resources
-* We can see symptoms but it’s hard to find the root cause
-* Recovery might be complicated
-
-The integration of Vitrage and Mistral can help identifying the root cause and
-taking corrective actions, in an end-to-end self-healing scenario.
-
-Vitrage is the OpenStack Root Cause Analysis service for organizing, analyzing
-and visualizing OpenStack and external alarms. It is used to provide insights
-about the root cause of problems and deduce their existence before they are
-directly reported.
-
-Mistral is the OpenStack workflow service. It aims to provide a mechanism to
-define tasks and workflows without writing code, manage and execute them in the
-cloud environment.
-
-
-Use Cases
-=========
-
-The integration of Vitrage with Mistral supports two kinds of use cases:
-
-* Automatic workflow execution, based on predefined conditions
-* Manual workflow execution from Vitrage Entity Graph (WIP in Rocky)
-
-
-Use Case 1: NIC failure causes automatic instance migration
-----------------------------------------------------------
-
-*"As a cloud operator, whenever one of my cloud's compute nodes has a NIC
-failure, I want to be notified of all affected resources including instances
-and applications. Moreover, I want the failed instances to be migrated away to
-another hardware."*
-
-In a complex system, a failure in one resource can have a wide effect on other
-resources. One example is a NIC failure, that may cause the host, as well as
-all instances running on it, to become unreachable. This may also affect
-applications that are using these instances and lose their high-availability.
-
-To identify the failed resources, the cloud operator can use Vitrage. Vitrage
-will be notified by an external monitor (such as Zabbix) about the failed NIC.
-Based on its cloud topology awareness, Vitrage will raise additional alarms on
-the host, instances and affected applications.
-
-An affected application will most likely be running in HA mode, so it will
-perform a fail-over to the standby instance. However, it will lose its
-high-availability. In order to fix it, Vitrage can execute a Mistral workflow
-that will migrate the failed instance to a different host, so the application
-will get back to a fully-operational state.
-
-.. figure:: ./vitrage_and_mistral.png
-   :scale: 100 %
-   :align: center
-   :alt: alternate text
-
-Use Case 2: NIC failure with an optional manual instance migration
------------------------------------------------------------------
-
-*"As a cloud operator, whenever one of my cloud's compute nodes has a NIC
-failure, I want to be notified of all affected resources including instances
-and applications. I then want an easy way to manually migrate a failed
-instance to another compute and track its state."*
-
-This is currently WIP in Rocky.
-
-The use case is similar to use case 1, but in this use case the cloud operator
-did not pre-configured Vitrage to execute a Mistral workflow when an
-application is affected by an instance being unreachable.
-
-As a result of a NIC failure, Vitrage raises alarms on the host, its instances
-and the applications that are using them. The cloud operator can see this
-information in Vitrage Entity Graph, locate a failed instance that affects an
-application, and ask to execute a VM-migration Mistral workflow on that
-instance.
-
-
-Technical Details
-=================
-
-Vitrage ``evaluator templates`` define the business logic and the way that
-Vitrage handles alarms and resource states. A template contains ``scenarios``,
-where each scenario is made of ``condition`` and ``actions``.
-
-Among other actions (like raise an alarm or modify the state of a resource),
-the cloud operator can ask to execute a Mistral workflow with certain
-parameters. For example, the cloud operator can define this scenario:
-
-* ``condition:`` an application contains an instance that is unreachable
-* ``action:`` execute a Mistral VM-Migration workflow on that instance
-
-More details about Vitrage template definitions can be found here_
-
-.. _here: https://docs.openstack.org/vitrage/latest/contributor/vitrage-template-format.html
-
-
-Note that Vitrage could call Nova evacuate directly for the failed instance,
-but using a Mistral workflow is a much more robust option. Mistral can track
-the Nova evacuation process, check its status and verify that everything worked
-as expected.
-
-
-References
-==========
-
- https://www.openstack.org/videos/sydney-2017/advanced-fault-management-with-vitrage-and-mistral
-
- https://wiki.openstack.org/wiki/Vitrage
-
- https://docs.openstack.org/mistral/latest/