Spec for host recovery
Change-Id: Ifa2583901cd2dff0b450d81fd7de96b27e9c315a
This commit is contained in:
parent
468d526263
commit
e243a2c545
161
specs/newton/approved/newton-instance-ha-host-recovery.rst
Normal file
161
specs/newton/approved/newton-instance-ha-host-recovery.rst
Normal file
@ -0,0 +1,161 @@
|
|||||||
|
..
|
||||||
|
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||||
|
License.
|
||||||
|
|
||||||
|
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||||
|
|
||||||
|
=============
|
||||||
|
Host Recovery
|
||||||
|
=============
|
||||||
|
|
||||||
|
The purpose of this spec is to describe a method to recover all virtual
|
||||||
|
machines that are on the host after its failure.
|
||||||
|
|
||||||
|
Problem description
|
||||||
|
===================
|
||||||
|
|
||||||
|
In case of whole compute node failure, recovering of instances is crucial for
|
||||||
|
providing the high availability for the virtual machines. On the other hand,
|
||||||
|
automatic recovery of some instances may cause even more problems than the fact,
|
||||||
|
that they were suddenly turned off.
|
||||||
|
|
||||||
|
When taking both arguments into account it seems obvious that there is a need
|
||||||
|
for automatic recovery that has to be configurable, on both instance and host
|
||||||
|
level. This spec is to describe what are possible actions in case of compute
|
||||||
|
node failure and to describe the configuration. Automatic recovery of
|
||||||
|
particular instances is out of scope of this spec and would be described in
|
||||||
|
another document.
|
||||||
|
|
||||||
|
Use Cases
|
||||||
|
---------
|
||||||
|
|
||||||
|
* As a cloud operator, I would like to provide my users with highly
|
||||||
|
available VMs to meet high SLA requirements. Therefore, I need some of my VMs
|
||||||
|
to automatically resurrect after compute node failure.
|
||||||
|
|
||||||
|
Proposed change
|
||||||
|
===============
|
||||||
|
|
||||||
|
VMs recovery can be perform on the control plane of OpenStack cloud. It would be
|
||||||
|
done using mistral workflow service and pacemaker resource agent. The resource
|
||||||
|
agent would be responsible for starting the workflow, whereas mistral would
|
||||||
|
be responsible for performing *nova_evacuate* for each VM and for observing the
|
||||||
|
state of each evacuated VM. Usage of mistral would ensure that evacuation
|
||||||
|
workflow will end, even if some of the controllers dies during the process.
|
||||||
|
|
||||||
|
Alternatives
|
||||||
|
------------
|
||||||
|
|
||||||
|
1. We may not use mistral workflow at all and do all *nova_evacuate* related
|
||||||
|
stuff in the pacemaker resource agents. But this means that we would have to
|
||||||
|
implement all the HA mechanism in it, which would be difficult.
|
||||||
|
|
||||||
|
2. We may try to implement real *host-evacuate* in nova. Right now
|
||||||
|
*host-evacuate* iterate over all instances from given host on the client side.
|
||||||
|
We can try to change it and implement it in nova, but nova cores were against
|
||||||
|
this change in the past.
|
||||||
|
|
||||||
|
Data model impact
|
||||||
|
-----------------
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
API impact
|
||||||
|
----------
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
Security impact
|
||||||
|
---------------
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
Other end user impact
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
Performance Impact
|
||||||
|
------------------
|
||||||
|
|
||||||
|
There would be extra amount of RAM and CPU needed on each controller node to
|
||||||
|
run both pacemaker and mistral services. If they are already present on the
|
||||||
|
control plane, there would be no performance impact.
|
||||||
|
|
||||||
|
Other deployer impact
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
Distributions need to package and deploy an extra services on each
|
||||||
|
controller node. Those services are mistral service and pacemaker resource
|
||||||
|
agent.
|
||||||
|
|
||||||
|
Developer impact
|
||||||
|
----------------
|
||||||
|
|
||||||
|
Nothing other than the listed work items below.
|
||||||
|
|
||||||
|
Implementation
|
||||||
|
==============
|
||||||
|
|
||||||
|
Resource agent would receive information from host monitor, that given host
|
||||||
|
is down. Then it would send a request to mistral to start recovery workflow.
|
||||||
|
Request needs to have below input parameters:
|
||||||
|
|
||||||
|
.. code-block:: json
|
||||||
|
{
|
||||||
|
"search_opts": {
|
||||||
|
"host": COMPUTE_NAME
|
||||||
|
},
|
||||||
|
"on_shared_storage": [true|false]
|
||||||
|
}
|
||||||
|
|
||||||
|
Assignee(s)
|
||||||
|
-----------
|
||||||
|
|
||||||
|
Primary assignee:
|
||||||
|
<launchpad-id or None>
|
||||||
|
|
||||||
|
Other contributors:
|
||||||
|
<launchpad-id or None>
|
||||||
|
|
||||||
|
Work Items
|
||||||
|
----------
|
||||||
|
|
||||||
|
* Prepare resource agent that would trigger mistral
|
||||||
|
* Prepare mistral workflow
|
||||||
|
* Document changes in HA guide
|
||||||
|
|
||||||
|
Dependencies
|
||||||
|
============
|
||||||
|
|
||||||
|
Host monitor
|
||||||
|
|
||||||
|
Testing
|
||||||
|
=======
|
||||||
|
|
||||||
|
Documentation Impact
|
||||||
|
====================
|
||||||
|
|
||||||
|
The service should be documented in the ha-guide.
|
||||||
|
|
||||||
|
References
|
||||||
|
==========
|
||||||
|
|
||||||
|
- `Instance HA etherpad started at Newton Design Summit in Austin
|
||||||
|
<https://etherpad.openstack.org/p/newton-instance-ha>`_
|
||||||
|
|
||||||
|
- `"High Availability for Virtual Machines" user story
|
||||||
|
<http://specs.openstack.org/openstack/openstack-user-stories/user-stories/proposed/ha_vm.html>`_
|
||||||
|
|
||||||
|
- `video of "HA for Pets and Hypervisors" presentation at OpenStack conference in Austin
|
||||||
|
<https://youtu.be/lddtWUP_IKQ>`_
|
||||||
|
|
||||||
|
- `automatic-evacuation etherpad
|
||||||
|
<https://etherpad.openstack.org/p/automatic-evacuation>`_
|
||||||
|
|
||||||
|
- `Instance auto-evacuation cross project spec (WIP)
|
||||||
|
<https://review.openstack.org/#/c/257809>`_
|
||||||
|
|
||||||
|
|
||||||
|
History
|
||||||
|
=======
|
Loading…
x
Reference in New Issue
Block a user