self-healing-sig/use-cases/heat-mistral-aodh.rst
Adam Spiers 3bca09913f Remove CC license headers from individual specs / use cases
These are unnecessary because they are already covered by the
top-level LICENSE file.

Change-Id: I98145df081c03ce5cfa0ecfbf5133f9d658cc126
2019-04-25 23:54:21 +01:00

123 lines
4.3 KiB
ReStructuredText

=============================================================
Self-healing and auto-scaling with Heat, Mistral, Zaqar, Aodh
=============================================================
This use case describes an existing scenario for auto-scaling a
self-healing cluster.
OpenStack projects used
=======================
* Heat
* Mistral
* Zaqar
* Telemetry (Aodh, Pinko, Gnocchi)
Remediation class
=================
Reactive
Fault detection
===============
The self-healing scenario detect failure like when any instance in
cluster been delete, or updated out side of this automatic flow.
Inputs and decision-making
==========================
* A cluster builder (Heat):
You can use Heat to build cluster, managing resource dependancies,
plan some flexibility with batch action, and also reproducable to
another OpenStack environment with same template. Whatever you're
using for building cluster, you need to make sure it allow to your
flow control service (Mistral) to trigger update and replace
unhealthy node in cluster.
* A monitoring service (Aodh):
There are multiple choices for self-healing monitors to monitor that
cluster and notify queue service; you can absolutely replace any
monitoring service in this showcase and use your own existing
services. Just make sure you're able to monitoring nodes in cluster
and able to send notification out when alarm triggered.
* A queue service (Zaqar):
A queue or any services that accept notification from alarm service
and capable to trigger healing flow.
* A triggerable flow control service (Mistral):
A triggerable flow control service to run healing flow. Make sure
your flow contains most cases. So we can guarantee that cluster is
healed. Covering most successes and failure cases (as many as you
can) is always a good idea.
We're using Heat, Aodh, Zaqar, and Mistral in this example. You can
always replace any service with your own services. Just make sure
they're fully integrated to each other for your healing scenario.
Remediation
===========
In our example, for each unit in the cluster, we set up a Nova
instance to run application on, binding that instance to a
load-balancer so the application can use lbaas in its infrastructure.
For healing, we create multiple event alarms monitoring the status of
that instance, a Zaqar queue for alarms to send notifications on, and
a Zaqar MistralTrigger (a resource type in Heat to use Zaqar
subscription to trigger Mistral workflow) to listen on that queue and
trigger Mistral workflow to mark this unit as unhealthy and then use
Heat stack update to replace unhealthy unit in cluster.
At this point you already got your structure for self-healing. As you
can see in the example, we are monitoring events like
``compute.instance.delete.*``. If you also use Aodh, please make sure
the event you're monitoring is enabled from services and ceilometer
(be aware that some events are disabled by default).
The event alarm going to send a ``POST`` request to Zaqar queue if the
instance status changed to delete. Again, it's possible to use your
own alarm services.
Once Zaqar queue gets that information, it will trigger a Mistral
workflow. That workflow will mark that unit (in this case it's a
server with nginx) as unhealthy in Heat and trigger a stack
update. Heat will replace all unhealthy unit and regenerate one for
your cluster.
Existing implementation(s)
==========================
With ability to self-heal, you can (as we show in this case) combine
with autoscaling ability. Also as you can notice, we use Heat
container agent, so your image only need to support ability to run
container (instead of pre-install services with
diskimage-builder). Here is the example Heat template :doc:`sample
files <heat-mistral-aodh>`. Also here's link for relevant `slides
(with video link)
<https://www.slideshare.net/GuanYuLin1/autoscale-a-selfhealing-cluster-in-openstack-with-heat>`_.
Future work
===========
We need better support for scaling with some resources require trust.
All further work is tracked in `the story "Improve Autoscaling
integration with Self-healing"
<https://storyboard.openstack.org/#!/story/2003690>`_. Also
integration with Vitrage is work in progress tracked in `the story
"Achieve Self healing with Vitrage Mistral And Heat"
<https://storyboard.openstack.org/#!/story/2002684>`_.