Change-Id: I364c170f3c60872b777dc203572f158c2b5408dc
4.2 KiB
Host Recovery
The purpose of this spec is to describe a method to recover all virtual machines that are on the host after its failure.
Problem description
In case of whole compute node failure, recovering of instances is crucial for providing the high availability for the virtual machines. On the other hand, automatic recovery of some instances may cause even more problems than the fact, that they were suddenly turned off.
When taking both arguments into account it seems obvious that there is a need for automatic recovery that has to be configurable, on both instance and host level. This spec is to describe what are possible actions in case of compute node failure and to describe the configuration. Automatic recovery of particular instances is out of scope of this spec and would be described in another document.
Use Cases
* As a cloud operator, I would like to provide my users with highly available VMs to meet high SLA requirements. Therefore, I need some of my VMs to automatically resurrect after compute node failure.
Proposed change
VMs recovery can be perform on the control plane of OpenStack cloud. It would be done using mistral workflow service and pacemaker resource agent. The resource agent would be responsible for starting the workflow, whereas mistral would be responsible for performing nova_evacuate for each VM and for observing the state of each evacuated VM. Usage of mistral would ensure that evacuation workflow will end, even if some of the controllers dies during the process.
Alternatives
1. We may not use mistral workflow at all and do all nova_evacuate related stuff in the pacemaker resource agents. But this means that we would have to implement all the HA mechanism in it, which would be difficult.
2. We may try to implement real host-evacuate in nova. Right now host-evacuate iterate over all instances from given host on the client side. We can try to change it and implement it in nova, but nova cores were against this change in the past.
Data model impact
None
API impact
None
Security impact
None
Other end user impact
None
Performance Impact
There would be extra amount of RAM and CPU needed on each controller node to run both pacemaker and mistral services. If they are already present on the control plane, there would be no performance impact.
Other deployer impact
Distributions need to package and deploy an extra services on each controller node. Those services are mistral service and pacemaker resource agent.
Developer impact
Nothing other than the listed work items below.
Implementation
Resource agent would receive information from host monitor, that given host is down. Then it would send a request to mistral to start recovery workflow. Request needs to have below input parameters:
Assignee(s)
- Primary assignee:
-
<launchpad-id or None>
- Other contributors:
-
<launchpad-id or None>
Work Items
- Prepare resource agent that would trigger mistral
- Prepare mistral workflow
- Document changes in HA guide
Dependencies
Host monitor
Testing
Documentation Impact
The service should be documented in the ha-guide.
References
- Instance HA etherpad started at Newton Design Summit in Austin
- "High Availability for Virtual Machines" user story
- video of "HA for Pets and Hypervisors" presentation at OpenStack conference in Austin
- automatic-evacuation etherpad
- Instance auto-evacuation cross project spec (WIP)