VM Recovery

The purpose of this spec is to describe a method for recover the VMs from VM failures. Change-Id: I3648aacc2cfefe2bb5981f694415ceab17b2dfb8
2016-10-17 17:42:27 +09:00 · 2016-10-17 17:42:27 +09:00 · 8a4a70db74
commit 8a4a70db74
parent e243a2c545
1 changed files with 200 additions and 0 deletions
--- a/specs/newton/approved/newton-instance-ha-vm-recovery-spec.rst
+++ b/specs/newton/approved/newton-instance-ha-vm-recovery-spec.rst
@ -0,0 +1,200 @@
 ..
 This work is licensed under a Creative Commons Attribution 3.0 Unported
 License.
 http://creativecommons.org/licenses/by/3.0/legalcode
 ==========================================
 VM Recovery
 ==========================================
 The purpose of this spec is to describe a method to recover
 individual virtual machines that are marked as failed by
 the VM monitoring component.
 Problem description
 ===================
 VM failure can be detected by VM monitoring method discussed in
 `vm monitoring spec`__.
 __ https://review.openstack.org/#/c/352217/
 When VM failure event is detected, appropriate recovery actions must
 be taken. Those recovery actions should be decided using configurable
 policies based on inputs such as the state of storage (shared or
 otherwise), status of the VM, and cause of the VM failure.
 Use Cases
 ---------
 As a cloud operator, I would like to provide my users with highly
 available VMs to meet high SLA requirements. There are several types
 of VM failure events that can occur in OpenStack clouds.
 We need to make sure such events can be detected and recovered
 by the system. Possible VM failure events include:
 - VM crashes
 - VM hangs
 Possible recovery methods include:
 - VM restart (stop and start)
 - VM restart on different host
 Scope
 -----
 This spec only addresses recovery from isolated failures of individual
 VMs. Monitoring of the VMs, and detection and recovery from wider
 failures, such as failure of a whole compute host, will be covered by
 separate specs, and are therefore out of scope for this spec.
 This spec has the following goals:
 1. Encourage all implementations of VM recovery, whether upstream or
   downstream, to receive failure notifications in a standardized
   manner. This will allow cloud vendors and operators to implement
   HA of the compute plane via a collection of compatible components
   (of which one is compute node monitoring), whilst not being tied to
   any one implementation.
 2. Suggest appropriate actions which can be taken for each failure
   case.
 3. Provide details of and recommend a specific implementation which
   for the most part already exists and is proven to work.
 4. Identify gaps with that implementation and corresponding future
   work required.
 Proposed change
 ===============
 VM monitors send failure events to a recovery workflow service. This
 workflow service can analyze the content of the failure event message
 and execute the appropriate recovery action. This workflow service
 could also handle the advanced recovery options such as maximum
 restart threshold, execute next recovery action or execute multiple
 workflows.
 If a VM crashes, the first approach to recovery is stop and start the
 VM from nova-api. The maximum restart threshold should be
 configurable, and it could be 0, which means do not restart and go to
 next recovery method. If restart fails, or threshold is 0, it should
 try to restart the VM on a different host. The threshold could even be
 -1, to indicate an infinite number of retries on this host, preventing
 the VM from ever being restarted on a different host. This might be
 desirable in certain configurations where there is no shared storage
 for ephemeral disks, and rebuild of a disk from a glance image during
 ``nova evacuate`` is undesirable.
 If a VM hangs due to an I/O error, the recovery service may be
 required to automatically disable the ``nova-compute`` service on that
 host and restart the VM on a different host. It could also migrate
 other VMs from the host, in order to preempt further I/O errors.
 Implementation
 ==============
 There are at least three possible ways to implement the proposed
 change:
 1. Use Masakari as recovery workflow service
   VM monitors send the failure events to Masakari using Masakari's
   notification API. Masakari will execute pre-defined recovery actions.
 2. Use Mistral as recovery workflow service
   VM monitors call the Mistral workflow to execute execute appropriate
   recovery actions.
 3. Use Masakari as recovery engine and Mistral as workflow service
   VM monitors send the failure events to Masakari and Masakari will
   analyze the content of the failure event message and call Mistral
   workflow to execute recovery actions.
 Data model impact
 -----------------
 None
 REST API impact
 ---------------
 The HTTP API of the VM recovery workflow service needs to be able to
 receive events in the format they are sent by the VM monitor.
 Security impact
 ---------------
 Ideally it should be possible for the VM monitor to send instance
 event data securely to the recovery workflow service (e.g. via TLS),
 without relying on the security of the admin network over which the
 data is sent.
 Other end user impact
 ---------------------
 None
 Performance Impact
 ------------------
 None
 Other deployer impact
 ---------------------
 Developer impact
 ----------------
 Documentation Impact
 --------------------
 The service should be documented in the |ha-guide|_.
 .. |ha-guide| replace:: OpenStack High Availability Guide
 .. _ha-guide: http://docs.openstack.org/ha-guide/
 Assignee(s)
 -----------
 Primary assignee:
  <launchpad-id or None>
 Other contributors:
  <launchpad-id or None>
 Work Items
 ==========
 WIP
 Dependencies
 ============
 Testing
 =======
 Documentation Impact
 ====================
 References
 ==========
 History
 =======