Spec for host recovery

Change-Id: Ifa2583901cd2dff0b450d81fd7de96b27e9c315a
2016-10-14 13:57:09 +02:00 · 2016-10-14 13:57:09 +02:00 · e243a2c545
commit e243a2c545
parent 468d526263
1 changed files with 161 additions and 0 deletions
--- a/specs/newton/approved/newton-instance-ha-host-recovery.rst
+++ b/specs/newton/approved/newton-instance-ha-host-recovery.rst
@ -0,0 +1,161 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+=============
+Host Recovery
+=============
+
+The purpose of this spec is to describe a method to recover all virtual
+machines that are on the host after its failure.
+
+Problem description
+===================
+
+In case of whole compute node failure, recovering of instances is crucial for
+providing the high availability for the virtual machines. On the other hand,
+automatic recovery of some instances may cause even more problems than the fact,
+that they were suddenly turned off.
+
+When taking both arguments into account it seems obvious that there is a need
+for automatic recovery that has to be configurable, on both instance and host
+level. This spec is to describe what are possible actions in case of compute
+node failure and to describe the configuration. Automatic recovery of
+particular instances is out of scope of this spec and would be described in
+another document.
+
+Use Cases
+---------
+
+* As a cloud operator, I would like to provide my users with highly
+available VMs to meet high SLA requirements. Therefore, I need some of my VMs
+to automatically resurrect after compute node failure.
+
+Proposed change
+===============
+
+VMs recovery can be perform on the control plane of OpenStack cloud. It would be
+done using mistral workflow service and pacemaker resource agent. The resource
+agent would be responsible for starting the workflow, whereas mistral would
+be responsible for performing *nova_evacuate* for each VM and for observing the
+state of each evacuated VM. Usage of mistral would ensure that evacuation
+workflow will end, even if some of the controllers dies during the process.
+
+Alternatives
+------------
+
+1. We may not use mistral workflow at all and do all *nova_evacuate* related
+stuff in the pacemaker resource agents. But this means that we would have to
+implement all the HA mechanism in it, which would be difficult.
+
+2. We may try to implement real *host-evacuate* in nova. Right now
+*host-evacuate* iterate over all instances from given host on the client side.
+We can try to change it and implement it in nova, but nova cores were against
+this change in the past.
+
+Data model impact
+-----------------
+
+None
+
+API impact
+----------
+
+None
+
+Security impact
+---------------
+
+None
+
+Other end user impact
+---------------------
+
+None
+
+Performance Impact
+------------------
+
+There would be extra amount of RAM and CPU needed on each controller node to
+run both pacemaker and mistral services. If they are already present on the
+control plane, there would be no performance impact.
+
+Other deployer impact
+---------------------
+
+Distributions need to package and deploy an extra services on each
+controller node. Those services are mistral service and pacemaker resource
+agent.
+
+Developer impact
+----------------
+
+Nothing other than the listed work items below.
+
+Implementation
+==============
+
+Resource agent would receive information from host monitor, that given host
+is down. Then it would send a request to mistral to start recovery workflow.
+Request needs to have below input parameters:
+
+.. code-block:: json
+    {
+        "search_opts": {
+            "host": COMPUTE_NAME
+        },
+        "on_shared_storage": [true|false]
+    }
+
+Assignee(s)
+-----------
+
+Primary assignee:
+  <launchpad-id or None>
+
+Other contributors:
+  <launchpad-id or None>
+
+Work Items
+----------
+
+* Prepare resource agent that would trigger mistral
+* Prepare mistral workflow
+* Document changes in HA guide
+
+Dependencies
+============
+
+Host monitor
+
+Testing
+=======
+
+Documentation Impact
+====================
+
+The service should be documented in the ha-guide.
+
+References
+==========
+
+- `Instance HA etherpad started at Newton Design Summit in Austin
+  <https://etherpad.openstack.org/p/newton-instance-ha>`_
+
+- `"High Availability for Virtual Machines" user story
+  <http://specs.openstack.org/openstack/openstack-user-stories/user-stories/proposed/ha_vm.html>`_
+
+- `video of "HA for Pets and Hypervisors" presentation at OpenStack conference in Austin
+  <https://youtu.be/lddtWUP_IKQ>`_
+
+- `automatic-evacuation etherpad
+  <https://etherpad.openstack.org/p/automatic-evacuation>`_
+
+- `Instance auto-evacuation cross project spec (WIP)
+  <https://review.openstack.org/#/c/257809>`_
+
+
+History
+=======