Spec for host recovery
Change-Id: Ifa2583901cd2dff0b450d81fd7de96b27e9c315a
This commit is contained in:
parent
468d526263
commit
e243a2c545
161
specs/newton/approved/newton-instance-ha-host-recovery.rst
Normal file
161
specs/newton/approved/newton-instance-ha-host-recovery.rst
Normal file
@ -0,0 +1,161 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
=============
|
||||
Host Recovery
|
||||
=============
|
||||
|
||||
The purpose of this spec is to describe a method to recover all virtual
|
||||
machines that are on the host after its failure.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
In case of whole compute node failure, recovering of instances is crucial for
|
||||
providing the high availability for the virtual machines. On the other hand,
|
||||
automatic recovery of some instances may cause even more problems than the fact,
|
||||
that they were suddenly turned off.
|
||||
|
||||
When taking both arguments into account it seems obvious that there is a need
|
||||
for automatic recovery that has to be configurable, on both instance and host
|
||||
level. This spec is to describe what are possible actions in case of compute
|
||||
node failure and to describe the configuration. Automatic recovery of
|
||||
particular instances is out of scope of this spec and would be described in
|
||||
another document.
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
* As a cloud operator, I would like to provide my users with highly
|
||||
available VMs to meet high SLA requirements. Therefore, I need some of my VMs
|
||||
to automatically resurrect after compute node failure.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
VMs recovery can be perform on the control plane of OpenStack cloud. It would be
|
||||
done using mistral workflow service and pacemaker resource agent. The resource
|
||||
agent would be responsible for starting the workflow, whereas mistral would
|
||||
be responsible for performing *nova_evacuate* for each VM and for observing the
|
||||
state of each evacuated VM. Usage of mistral would ensure that evacuation
|
||||
workflow will end, even if some of the controllers dies during the process.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
1. We may not use mistral workflow at all and do all *nova_evacuate* related
|
||||
stuff in the pacemaker resource agents. But this means that we would have to
|
||||
implement all the HA mechanism in it, which would be difficult.
|
||||
|
||||
2. We may try to implement real *host-evacuate* in nova. Right now
|
||||
*host-evacuate* iterate over all instances from given host on the client side.
|
||||
We can try to change it and implement it in nova, but nova cores were against
|
||||
this change in the past.
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
None
|
||||
|
||||
API impact
|
||||
----------
|
||||
|
||||
None
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
There would be extra amount of RAM and CPU needed on each controller node to
|
||||
run both pacemaker and mistral services. If they are already present on the
|
||||
control plane, there would be no performance impact.
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
Distributions need to package and deploy an extra services on each
|
||||
controller node. Those services are mistral service and pacemaker resource
|
||||
agent.
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
Nothing other than the listed work items below.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Resource agent would receive information from host monitor, that given host
|
||||
is down. Then it would send a request to mistral to start recovery workflow.
|
||||
Request needs to have below input parameters:
|
||||
|
||||
.. code-block:: json
|
||||
{
|
||||
"search_opts": {
|
||||
"host": COMPUTE_NAME
|
||||
},
|
||||
"on_shared_storage": [true|false]
|
||||
}
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
<launchpad-id or None>
|
||||
|
||||
Other contributors:
|
||||
<launchpad-id or None>
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Prepare resource agent that would trigger mistral
|
||||
* Prepare mistral workflow
|
||||
* Document changes in HA guide
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
Host monitor
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
The service should be documented in the ha-guide.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
- `Instance HA etherpad started at Newton Design Summit in Austin
|
||||
<https://etherpad.openstack.org/p/newton-instance-ha>`_
|
||||
|
||||
- `"High Availability for Virtual Machines" user story
|
||||
<http://specs.openstack.org/openstack/openstack-user-stories/user-stories/proposed/ha_vm.html>`_
|
||||
|
||||
- `video of "HA for Pets and Hypervisors" presentation at OpenStack conference in Austin
|
||||
<https://youtu.be/lddtWUP_IKQ>`_
|
||||
|
||||
- `automatic-evacuation etherpad
|
||||
<https://etherpad.openstack.org/p/automatic-evacuation>`_
|
||||
|
||||
- `Instance auto-evacuation cross project spec (WIP)
|
||||
<https://review.openstack.org/#/c/257809>`_
|
||||
|
||||
|
||||
History
|
||||
=======
|
Loading…
Reference in New Issue
Block a user