15 KiB
Host Monitoring
The purpose of this spec is to describe a method for monitoring the health of OpenStack compute nodes.
Problem description
Monitoring compute node health is essential for providing high availability for VMs. A health monitor must be able to detect crashes, freezes, network connectivity issues, and any other OS-level errors on the compute node which prevent it from being able to run the necessary services in order to host existing or new VMs.
Use Cases
As a cloud operator, I would like to provide my users with highly available VMs to meet high SLA requirements. Therefore, I need my compute nodes automatically monitored for hardware failure, kernel crashes and hangs, and other failures at the operating system level. Any failure event detected needs to be passed to a compute host recovery workflow service which can then take the appropriate remedial action.
For example, if a compute host fails (or appears to fail to the extent that the monitor can detect), the recovery service will typically identify all VMs which were running on this compute host, and may take any of the following possible actions:
- Fence the host (STONITH) to eliminate the risk of a still-running instance being resurrected elsewhere (see the next step) and simultaneously running in two places as a result, which could cause data corruption.
- Resurrect some or all of the VMs on other compute hosts.
- Notify the cloud operator.
- Notify affected users.
- Make the failure and recovery events available to telemetry / auditing systems.
Scope
This spec only addresses monitoring the health of the compute node hardware and basic operating system functions, and notifying appropriate recovery components in the case of any failure.
Monitoring the health of nova-compute
and other
processes it depends on, such as libvirtd
and anything else
at or above the hypervisor layer, including individual VMs, will be
covered by separate specs, and are therefore out of scope for this
spec.
Any kind of recovery workflow is also out of scope and will be covered by separate specs.
This spec has the following goals:
- Encourage all implementations of compute node monitoring, whether upstream or downstream, to output failure notifications in a standardized manner. This will allow cloud vendors and operators to implement HA of the compute plane via a collection of compatible components (of which one is compute node monitoring), whilst not being tied to any one implementation.
- Provide details of and recommend a specific implementation which for the most part already exists and is proven to work.
- Identify gaps with that implementation and corresponding future work required.
Acceptance criteria
Here the words "must", "should" etc. are used with the strict meaning defined in RFC2119.
- Compute nodes must be automatically monitored for hardware failure, kernel crashes and hangs, and other failures at the operating system level.
- The solution must scale to hundreds of compute hosts.
- Any failure event detected must cause the component responsible for alerting to send a notification to a configurable endpoint so that it can be consumed by the cloud operator's choice of compute node recovery workflow controller.
- If a failure notification is not accepted by the recovery component, it should be persisted within the monitoring/alerting components, and sending of the notification should be retried periodically until it succeeds. This will ensure that remediation of failures is never dropped due to temporary failure or other unavailability of any component.
- The alerting component must be extensible in order to allow communication with multiple types of recovery workflow controller, via a driver abstraction layer, and drivers for each type. At least one driver must be implemented initially.
- One of the drivers should send notifications to an HTTP endpoint using a standardized JSON format as the payload.
- Another driver should send notifications to the masakari API server.
Implementation
The implementation described here was presented at OpenStack Day Israel, June 2017, from which this diagram should assist in understanding the below description.
Running a pacemaker_remote service on each compute host allows it to be monitored by a central Pacemaker cluster via a straight-forward TCP connection. This is an ideal solution to this problem for the following reasons:
- Pacemaker can scale to handling a very large number of remote nodes.
pacemaker_remote
can be simultaneously used for monitoring and managing services on each compute host.pacemaker_remote
is a very lightweight service which will not cause any significantly increased load on each compute host.- Pacemaker has excellent support for fencing for a wide range of STONITH devices, and it is easy to extend support to other devices, as shown by the fence_agents repository.
- Pacemaker is easily extensible via OCF Resource Agents, which allow custom design of monitoring and of the automated reaction when those monitors fail.
- Many clouds will already be running one or more Pacemaker clusters on the control plane, as recommended by the OpenStack High Availability Guide_, so deployment complexity is not significantly increased.
- This architecture is already implemented and proven via the commercially supported enterprise products RHEL OpenStack Platform and SUSE OpenStack Cloud, and via masakari which is used by production deployments at NTT.
Since many different tools are currently in use for deployment of OpenStack with HA, configuration of Pacemaker is currently out of scope for upstream projects, so the exact details will be left as the responsibility of each individual deployer. Nevertheless, examples of partial configurations for Pacemaker are given below.
Fencing
Fencing is technically outside the scope of this spec, in order to allow any cloud operator to choose their own clustering technology whilst remaining compliant and hence compatible with the notification standard described here. However, Pacemaker offers such a convenient solution to fencing which is also used to send the failure notification, so it is described here in full.
Pacemaker already implements effective heartbeat monitoring of its
remote nodes via the TCP connection with pacemaker_remote
,
so it only remains to ensure that the correct steps are taken when the
monitor detects failure:
- Firstly, the compute host must be fenced via an appropriate STONITH agent, for the reasons stated above.
- Once the host has been fenced, the monitor must mark the host as needing remediation in a manner which is persisted to disk (in case of changes in cluster state during handling of the failure) and read/write-accessible by a separate alerting component which can hand over responsibility of processing the failure to a recovery workflow controller, by sending it the appropriate notification.
These steps should be implemented by using two features of Pacemaker.
Firstly, its fencing_topology
configuration directive to
implement the second step as a custom fencing agent which is triggered
after the first step is complete. For example, the custom fencing agent
might be set up via a Pacemaker primitive
resource such
as:
primitive fence-nova stonith:fence_compute \
params auth-url="http://cluster.my.cloud.com:5000/v3/" \
domain=my.cloud.com \
tenant-name=admin \
endpoint-type=internalURL \
login=admin \
passwd=s3kr1t \
op monitor interval=10m
and then it could be configured as the second device in the fencing sequence:
fencing_topology compute1: stonith-compute1,fence-nova
Secondly, the fence_compute
agent here should persist
the marking of the fenced compute host via attrd,
so that a separate alerting component can transfer ownership of this
host's failure to a recovery workflow controller by sending it the
appropriate notification message.
It is worth noting that the fence_compute
fencing agent
already
exists as part of an earlier architecture, so it is strongly
recommended to reuse and adapt the existing implementation rather than
writing a new one from scratch.
Sending failure notifications to a host recovery workflow controller
There must be a highly available service responsible for taking host
failures marked in attrd
, notifying a recovery workflow
controller, and updating attrd
accordingly once appropriate
action has been taken. A suggested name for this service is
nova-host-alerter
.
It should be easy to ensure this alerter service is highly available
by placing it under management of the existing Pacemaker cluster. It
could be written as an OCF
resource agent, or as a Python daemon which is controlled by an OCF
/ LSB / systemd
resource agent.
The alerter service must contain an extensible driver-based architecture, so that it is capable of sending notifications to a number of different recovery workflow controllers.
In particular it must have a driver for sending notifications via the
masakari API. If the
service is implemented as a shell script, this could be achieved by
invoking masakari's notification-create
CLI, or if in
Python, via the python-masakariclient
library.
Ideally it should also have a driver for sending HTTP POST messages to a configurable endpoint with JSON data formatted in the following form:
{
"id": UUID,
"event_type": "host failure",
"version": "1.0",
"generated_time" : TIMESTAMP,
"payload": {
"hostname": COMPUTE_NAME
"on_shared_storage": [true|false],
"failure_time" : TIMESTAMP
},
}
COMPUTE_NAME
refers to the FQDN of the compute node on
which the failures have occurred. on_shared_storage
is
true
if and only if the compute host's instances are backed
by shared storage. failure_time
provides a timestamp (in
seconds since the UNIX epoch) for when the failure occurred.
This is already implemented as fence_evacuate.py, although the message sent by that script is currently specifically formatted to be consumed by Mistral.
Alternatives
No alternatives to the overall architecture are obviously apparent at this point. However it is possible that the use of attrd (which is functional but not comprehensively documented) could be substituted for some other highly available key/value attribute store, such as etcd.
Impact assessment
Data model impact
None
API impact
The HTTP API of the host recovery workflow service needs to be able to receive events in the format they are sent by this host monitor.
Security impact
Ideally it should be possible for the host monitor to send instance event data securely to the recovery workflow service (e.g. via TLS), without relying on the security of the admin network over which the data is sent.
Other end user impact
None
Performance Impact
There will be a small amount of extra RAM and CPU required on each
compute node for running the pacemaker_remote
service.
However it's a relatively simple service, so this should not have
significant impact on the node.
Other deployer impact
Distributions need to package pacemaker_remote
; however
this is already done for many distributions including SLES, openSUSE,
RHEL, CentOS, Fedora, Ubuntu, and Debian.
Automated deployment solutions need to deploy and configure the
pacemaker_remote
service on each compute node; however this
is a relatively simple task.
Developer impact
Nothing other than the listed work items below.
Documentation Impact
The service should be documented in the OpenStack High Availability Guide_.
Assignee(s)
Primary assignee:
- Adam Spiers
Other contributors:
- Sampath Priyankara
- Andrew Beekhof
- Dawid Deja
Work Items
- Implement
nova-host-alerter
(TODO: choose owner for this) - If appropriate, move the existing fence_evacuate.py to a more suitable long-term home (TODO: choose owner for this)
- Add SSL support (TODO: choose owner for this)
- Add documentation to the OpenStack High Availability Guide_
(
aspiers
/beekhof
)
Dependencies
Testing
Cloud99 could possibly be used for testing.
References
- Architecture diagram presented at OpenStack Day Israel, June 2017 (see also the video of the talk)
- "High Availability for Virtual Machines" user story
- Video of "High Availability for Instances: Moving to a Converged Upstream Solution" presentation at OpenStack conference in Boston, May 2017
- Instance HA etherpad started at Newton Design Summit in Austin, April 2016
- Video of "HA for Pets and Hypervisors" presentation at OpenStack conference in Austin, April 2016
- automatic-evacuation etherpad
- Existing fence agent which sends failure notification payload as JSON over HTTP.
- Instance auto-evacuation cross project spec (WIP)
History
Release Name | Description |
---|---|
Pike | Updated to have alerting mechanism decoupled from fencing process |
Newton | First introduced |