9.6 KiB
VM Monitoring
The purpose of this spec is to describe a method for monitoring the health of OpenStack VM instances without access to the VMs' internals.
Problem description
Monitoring VM health is essential for providing high availability for the VMs. Typically cloud operators cannot access inside VMs in order to monitor their health, because this would violate the contract between cloud operators and users that users have complete autonomy over the contents of their VMs and all actions are performed inside them. Operators cannot assume any knowledge of the software stack inside the VM or make any changes to it. Therefore, VM health monitoring must be done externally. This VM monitor must be able to detect VM crashes, hangs (e.g. due to I/O errors) and so on.
Use Cases
As a cloud operator, I would like to provide my users with highly available VMs to meet high SLA requirements. Therefore, I need my VMs automatically monitored for sudden stops, crashes, IO failures and similar. Any VM failure event detected needs to be passed to a VM recovery workflow service which takes the appropriate actions to recover the VM. For example:
- If a VM crashes, the recovery service will try to restart it, possibly on the same host at first, and then on a different host if it fails to restart or if it restarts successfully but then crashes a second time on the original host.
- If a VM receives an I/O error, the recovery service may prefer to
immediately contact
nova-api
to centrally disable thenova-compute
service on that host (so that no new VMs are scheduled on the host) and restart the VM on a different host. It could also potentially live-migrate all other VMs off that host, in order to pre-empt an further I/O errors.
Proposed change
VM monitoring can be done at the hypervisor level without accessing inside the VMs. In particular, libvirt_ provides a mechanism for monitoring its event stream via an event loop. We need to filter the required events and pass them to a recovery workflow service. In order to eliminate redundancy and improve extensibility, these event filters must be configurable.
Potential advantages:
- Catching events at their source (the hypervisor layer) means that we
don't have to rely on
nova
having knowledge of those events. For example,libvirtd
can output errors when a VM's I/O layer encounters issues, butnova
doesn't emit corresponding events for this. - It should be relatively easy to support a configurable event filter.
- The VM instance monitor can be run on each compute node, so it should scale well as the number of compute nodes increases.
- The VM instance monitors could be managed by pacemaker_remote via a new OCF RA (resource agent).
Alternatives
There are three alternatives to the proposed change:
Listen for VM status change events on message queue.
Potential disadvantages:
- It might be less reliable, if for some reason the message queue introduced latency or was lossy.
- There also might be some gaps in which events are propagated to the
queue; if so, we could submit a
nova
spec to plug the gaps. - If we listen for events from the control plane, it won't scale as well to large numbers of compute nodes, and then would be awkward to trigger recovery via Pacemaker.
Write a new
nova-libvirt
OCF RAIt would compare
nova
's expectations of which VMs should be running on the compute node with the reality. Any differences between the two would send appropriate failure events to the recovery workflow service.Potential disadvantages:
- This is more complexity than is expected to run inside an RA. RAs
are supposed to be lightweight components which simply start, stop, and
monitor services, whereas this would require abusing that model by
pretending there is a separate monitoring service when there isn't. The
monitor
action would need to fail when any differences as mentioned above were detected, and then thestop
orstart
action would need to send the failure events. - Within this "fake service" model, it's not clear how to avoid sending the same failure events over and over again until the failures were corrected.
- Typically RAs are implemented in
bash
. This is not a hard requirement, but something of this complexity would be much better coded in Python, resulting in either a mix of languages within the openstack-resource-agents repository
- This is more complexity than is expected to run inside an RA. RAs
are supposed to be lightweight components which simply start, stop, and
monitor services, whereas this would require abusing that model by
pretending there is a separate monitoring service when there isn't. The
Same as 2. above, but as part of the NovaCompute RA
- This has all the disadvantages of 2., but even more so, since new functionality would have to be mixed alongside the existing NovaCompute functionality.
Data model impact
None
API impact
The HTTP API of the VM recovery workflow service needs to be able to receive events in the format they are sent by this instance monitor.
Security impact
Ideally it should be possible for the instance monitor to send instance event data securely to the recovery workflow service (e.g. via TLS), without relying on the security of the admin network over which the data is sent.
Other end user impact
None
Performance Impact
There will be a small amount of extra RAM and CPU required on each compute node for running the instance monitor. However it's a relatively simple service, so this should not have significant impact on the node.
Other deployer impact
Distributions need to package and deploy an extra service on each compute node. However the existing instance monitor implementation in masakari already provides files to simplify packaging on the Linux distributions most commonly used for OpenStack infrastructure.
Developer impact
Nothing other than the listed work items below.
Implementation
libvirtd
uses QMP (QEMU
Machine Protocol) via UNIX domain socket
(/var/lib/libvirt/qemu/xxxx.monitor
) to communicate with
the VM domain. libvirt
catches the failure events and
passes them to the VM monitor. The VM monitor filters the events and
passes them to an external recovery workflow via HTTP, which then takes
the action required to recover the VM.
+-----------------------+
| +----------------+ |
| | VM | |
| | (qemu Process) | |
| +---------^------+ |
| | |QMP |
| +-----v----------+ |
| | libvirtd | |
| +---------^------+ |
| | | |
| +-----v----------+ | +-----------------------+
| | VM Monitor +------------>+ VM recovery workflow |
| +----------------+ | +-----------------------+
| |
| Compute Node |
+-----------------------+
We can almost certainly reuse the instance monitor provided by masakari.
FIXME:
- Need to detail how and in which format the event data should be sent
over HTTP. This should allow for support for other hypervisors
not based on
libvirt
being added in the future. - Need to give details of in which exact ways the service can be
configured.
- How should event filtering be configurable?
- Where should the configuration live? With masakari, it lives in
/etc/masakari-instancemonitor.conf
.
Assignee(s)
- Primary assignee:
-
<launchpad-id or None>
- Other contributors:
-
<launchpad-id or None>
Work Items
- Package masakari's instance monitor for SLES (aspiers)
- Add documentation to the OpenStack High Availability Guide_ (beekhof)
- Look into libvirt-test-API
- Write test suite
Dependencies
Testing
It may be possible to write a test suite using libvirt-test-API or at least some of its components.
Documentation Impact
The service should be documented in the OpenStack High Availability Guide_.
References
- Instance HA etherpad started at Newton Design Summit in Austin
- "High Availability for Virtual Machines" user story
- video of "HA for Pets and Hypervisors" presentation at OpenStack conference in Austin
- automatic-evacuation etherpad
- Instance auto-evacuation cross project spec (WIP)
History
Release Name | Description |
---|---|
Newton | Introduced |