397e40a89f
Update misc compute references to worker Tests Performed: Non-containerized deployment AIO-SX: Sanity and Nightly automated test suite AIO-DX: Sanity and Nightly automated test suite 2+2 System: Sanity and Nightly automated test suite 2+2 System: Horizon Patch Orchestration Kubernetes deployment: AIO-SX: Create, delete, reboot and rebuild instances 2+2+2 System: worker nodes are unlock enable and no alarms Story: 2004022 Task: 27013 Depends-On: https://review.openstack.org/#/c/624452/ Change-Id: I7b2de5c7202f2e86b55f5343c8d79d50e3a072a2 Signed-off-by: Tao Liu <tao.liu@windriver.com>
386 lines
19 KiB
Plaintext
Executable File
386 lines
19 KiB
Plaintext
Executable File
Copyright(c) 2013-2017, Wind River Systems, Inc.
|
|
|
|
Redistribution and use in source and binary forms, with or without
|
|
modification, are permitted provided that the following conditions
|
|
are met:
|
|
|
|
* Redistributions of source code must retain the above copyright
|
|
notice, this list of conditions and the following disclaimer.
|
|
* Redistributions in binary form must reproduce the above copyright
|
|
notice, this list of conditions and the following disclaimer in
|
|
the documentation and/or other materials provided with the
|
|
distribution.
|
|
* Neither the name of Wind River Systems nor the names of its
|
|
contributors may be used to endorse or promote products derived
|
|
from this software without specific prior written permission.
|
|
|
|
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
|
|
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
|
|
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
|
|
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
|
|
OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
|
|
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
|
|
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
|
|
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
|
|
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
|
|
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
|
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
|
-----------------------------------------------------------------------
|
|
|
|
This file contains instructions for using the Titanium Cloud Guest-Client.
|
|
|
|
|
|
Titanium Cloud Setup
|
|
=====================
|
|
The following steps are required to setup the Titanium Cloud to heartbeat
|
|
a VM.
|
|
|
|
1. Create and modify a Flavor for your VM.
|
|
|
|
A flavor extraspec, 'Guest Heartbeat', is used to indicate
|
|
that VMs of this flavor support Titanium Cloud Guest Heartbeat.
|
|
The default value is 'False'.
|
|
|
|
If support is indicated, then as soon as the VM's Titanium Cloud
|
|
Guest-Client daemon registers with the Titanium Cloud Compute
|
|
Services on the worker node host, heartbeating will be enabled.
|
|
|
|
a) Create a new flavor:
|
|
|
|
via dashboard ...
|
|
- Select 'Admin->Flavors' to bring up the list of flavors
|
|
- Select '+ Create Flavor' in the upper right.
|
|
- Fill in the fields as desired
|
|
- Select 'Create Flavor'
|
|
|
|
via command line ...
|
|
- nova flavor-create ...
|
|
|
|
b) Modify the newly created flavor or an existing flavor:
|
|
|
|
via dashboard ...
|
|
- Select 'Admin->Flavors' to bring up the list of flavors
|
|
- Choose a flavor to modify.
|
|
- Select the <flavor-name> to go to the Flavor Detail page
|
|
- Select the Extra Specs TAB
|
|
- Select '+ Create'
|
|
- Select 'Guest Heartbeat' from pull-down Extra Spec menu
|
|
- Check the 'Guest Heartbeat' checkbox
|
|
- Select 'Create'
|
|
|
|
via command line ...
|
|
- nova flavor-key <flavor-name> set sw:wrs:guest:heartbeat=True
|
|
|
|
Note: already running instances that were launched with this
|
|
flavor are NOT affected.
|
|
|
|
2) Launch a new instance of your VM.
|
|
|
|
3) Verify your VM is running with Guest Heartbeat enabled.
|
|
|
|
Log into the VM.
|
|
|
|
Guest-Client logs are written to syslog's 'daemon' facility, which
|
|
are typically logged by the syslog service to /var/log/daemon.log.
|
|
Please refer to syslog for details on log settings in order to
|
|
determine location of logged Guest-Client messages.
|
|
|
|
Guest-Client logs are easy to identify. The logs always contain the
|
|
string 'Guest-Client'. A recursive grep of /var/log is one way to
|
|
determine where your syslog is sending the Guest-Client logs.
|
|
|
|
LOG=`grep -r -l 'Guest-Client' /var/log`
|
|
echo $LOG
|
|
|
|
/var/log/daemon.log
|
|
|
|
A successful connection can be verified by looking for the
|
|
following log.
|
|
|
|
grep "Guest-Client" $USER_LOG | grep "heartbeat state change"
|
|
|
|
Guest-Client heartbeat state change from enabling to enabled
|
|
|
|
|
|
VM Setup
|
|
========
|
|
Configuring Guest-Client Initialization/Start Scripts
|
|
-----------------------------------------------------
|
|
The Titanium Cloud communicates with the Guest-Client through a character
|
|
device. The packaged initialization/startup scripts need to be updated to
|
|
specify the character device exposed by QEMU to the VM.
|
|
|
|
+-- Virtual Machine ---+
|
|
| |
|
|
| |
|
|
Titanium Cloud <-------------------> QEMU <------------> Guest-Client |
|
|
unix-stream-socket char-device |
|
|
| |
|
|
+----------------------+
|
|
|
|
The variable that needs updating in the initialization/start scripts is
|
|
called GUEST_CLIENT_DEVICE.
|
|
|
|
Also the location of the Guest-Client binary needs to be updated in the
|
|
initialization/start scripts. The variable that needs updating is called
|
|
GUEST_CLIENT.
|
|
|
|
|
|
Configuring Guest Heartbeat & Application Health Check
|
|
------------------------------------------------------
|
|
The Guest-Client within your VM will register with the Titanium Cloud
|
|
Compute Services on the worker node host. Part of that registration
|
|
process is the specification of a heartbeat interval and a corrective
|
|
action for a failed/unhealthy VM. The values of heartbeat interval and
|
|
corrective action come from the guest_heartbeat.conf file and is located
|
|
in /etc/guest-client/heartbeat directory by default.
|
|
|
|
Guest heartbeat works on a challenge response model. The Titanium
|
|
Server Compute Services on the worker node host will challenge the
|
|
Guest-Client daemon with a message each interval. The Guest-Client
|
|
must respond prior to the next interval with a message indicating good
|
|
health. If the Titanium Cloud Compute Services does not receive a valid
|
|
response, or if the response specifies that the VM is in ill health, then
|
|
corrective action is taken.
|
|
|
|
The mechanism can be extended by allowing additional VM resident application
|
|
specific scripts and processes, to register for heartbeating. Each script
|
|
or process can specify its own heartbeat interval, and its own corrective
|
|
action to be taken against the VM as a whole. On ill health the Guest-Client
|
|
reports ill health to the Titanium Cloud Compute Services on the worker node
|
|
host on the next challenge, and provoke the corrective action.
|
|
|
|
This mechanism allows for detection of a failed or hung QEMU/KVM instance,
|
|
or a failure of the OS within the VM to schedule the Guest-Client process
|
|
or to route basic IO, or an application level error/failure.
|
|
|
|
Configuring the Guest-Client Heartbeat & Application Health Check ...
|
|
|
|
The heartbeat interval defaults to every second and can be overridden
|
|
by the VM in the guest_heartbeat.conf.
|
|
|
|
/etc/guest-client/heartbeat/guest_heartbeat.conf:
|
|
## This specifies the interval between heartbeats in milliseconds between the
|
|
## guest-client heartbeat and the Titanium Cloud Compute Services on the
|
|
## worker node host.
|
|
HB_INTERVAL=1000
|
|
|
|
The corrective action defaults to 'reboot' and can be overridden by the
|
|
VM in the guest_heartbeat.conf.
|
|
|
|
/etc/guest-client/heartbeat/guest_heartbeat.conf:
|
|
## This specifies the corrective action against the VM in the case of a
|
|
## heartbeat failure between the guest-client and Titanium Cloud Compute
|
|
## Services on the worker node host and also when the health script
|
|
## configured below fails.
|
|
##
|
|
## Your options are:
|
|
## "log" Only a log is issued.
|
|
## "reboot" Issue a reboot against this VM.
|
|
## "stop" Issue a stop against this VM.
|
|
##
|
|
CORRECTIVE_ACTION="reboot"
|
|
|
|
A health check script can be registered to run periodically to verify
|
|
the health of the VM. This is specified in the guest_heartbeat.conf.
|
|
|
|
/etc/guest-client/heartbeat/guest_heartbeat.conf:
|
|
## The Path to the health check script. This is optional.
|
|
## The script will be called periodically to check for the health of the VM.
|
|
## The health check interval is specified in seconds.
|
|
HEALTH_CHECK_INTERVAL=30
|
|
HEALTH_CHECK_SCRIPT="/etc/guest-client/heartbeat/sample_health_check_script"
|
|
|
|
|
|
Configuring Guest Notifications and Voting
|
|
------------------------------------------
|
|
The Guest-Client running in the VM can be used as a conduit for
|
|
notifications of VM lifecycle events being taken by the Titanium Cloud that
|
|
will impact this VM. Reboots, pause/resume and migrations are examples of
|
|
the types of events your VM can be notified of. Depending on the event, a
|
|
vote on the event maybe required before a notification is sent. Notifications
|
|
may precede the event, follow it or both. The full table of events and
|
|
notifications is found below.
|
|
|
|
Titanium Action Event Name Vote* Pre-notification Post-notification Timeout
|
|
--------------- ----------------- ---- ---------------- ----------------- -------
|
|
stop stop yes yes no shutdown
|
|
reboot reboot yes yes no shutdown
|
|
pause pause yes yes no suspend
|
|
unpause unpause no no yes resume
|
|
suspend suspend yes yes no suspend
|
|
resume resume no no yes resume
|
|
resize resize_begin yes yes no suspend
|
|
resize_end no no yes resume
|
|
live-migrate live_migrate_begin yes yes no suspend
|
|
live_migrate_end no no yes resume
|
|
cold-migrate cold_migrate_begin yes yes no suspend
|
|
cold_migrate_end no no yes resume**
|
|
|
|
* voting has its own timeout called 'vote' that is event independent.
|
|
** after VM reboot and reconnection which is subject to the 'restart' timeout.
|
|
|
|
Notifications are an opportunity for the VM to take preparatory actions
|
|
in anticipation of the forthcoming event, or recovery actions after
|
|
the event has completed. A few examples
|
|
- A reboot or stop notification might allow the application to stop
|
|
accepting transactions and cleanly wrap up existing transactions.
|
|
- A 'resume' notification after a suspend might trigger a time
|
|
adjustment.
|
|
- Pre and post migrate notifications might trigger the application
|
|
to de-register and then re-register with a network load balancer.
|
|
|
|
If you register a notification handler, it will receive all events. If
|
|
an event is not of interest, it should return immediately with a
|
|
successful return code.
|
|
|
|
A process may only register a single notification handler. However
|
|
multiple processes may independently register handlers. Also a script
|
|
based handler may be registered via the guest_heartbeat.conf. When
|
|
multiple processes and scripts register notification handlers, they
|
|
will be run in parallel.
|
|
|
|
Notifications are subject to configurable timeouts. Timeouts are
|
|
specified by each registered process and in the guest_heartbeat.conf.
|
|
The timeouts in the guest_heartbeat.conf govern the maximum time all
|
|
registered notification handlers have to complete.
|
|
|
|
While pre-notification handlers are running, the event will be delayed.
|
|
If the timeout is reached, the event will be allowed to proceed.
|
|
|
|
While post-notification handlers are running, or waiting to be run,
|
|
the Titanium Cloud will not be able to declare the action complete.
|
|
Keep in mind that many events that offer a post notification will
|
|
require the VM's Guest-Client to reconnect to the worker host, and
|
|
that may be further delayed while the VM is rebooted as in a cold
|
|
migration. When post-notification is finally triggered, it is subject
|
|
to a timeout as well. If the timeout is reached, the event will be
|
|
declared complete.
|
|
|
|
NOTE: A post-event notification that follows a reboot, as in the
|
|
cold_migrate_end event, is a special case. It will be triggered as
|
|
soon as the local heartbeat server reconnects with the worker host,
|
|
and likely before any processes have a chance to register a handler.
|
|
The only handler guaranteed to see such a notification is a script
|
|
directly registered by the Guest-Client itself via guest_heartbeat.conf.
|
|
|
|
|
|
In addition to notifications, there is also an opportunity for the VM
|
|
to vote on any proposed event. Voting precedes all notifications,
|
|
and offers the VM a chance to reject the event the Titanium Cloud wishes
|
|
to initiate. If multiple handlers are registered, it only takes one
|
|
rejection to abort the event.
|
|
|
|
The same handler that handles notifications also handles voting.
|
|
|
|
Voting is subject to a configurable timeout. The same timeout applies
|
|
regardless of the event. The timeout is specified when the Guest-Client
|
|
registers with compute services on the host. The timeout is specified in
|
|
the guest_heartbeat.conf file. This timeout governs the maximum time all
|
|
registered voting handlers have to complete the vote.
|
|
|
|
Any voters that fail to vote within the timeout are assumed to have agreed
|
|
with the proposed action.
|
|
|
|
Rejecting an event should be the exception, not the rule, reserved for
|
|
cases when the VM is handling an exceptionally sensitive operation,
|
|
as well as a slow operation that can't complete in the notification timeout.
|
|
An example
|
|
- an active-standby application deployment (1:1), where the active
|
|
rejects a shutdown or pause or ... due to its peer standby is not
|
|
ready or synchronized.
|
|
|
|
A vote handler should generally not take any action beyond returning its
|
|
vote. Just because you vote to accept, doesn't mean all your peers
|
|
will also accept (i.e. the event might not happen). Taking an action
|
|
against an event that never happens is almost certainly NOT what you want.
|
|
Instead save your actions for the notification that follows if no one
|
|
rejects. The one exception might be to temporarily block the initiation of
|
|
any new task that would cause you to vote to reject an event in the near
|
|
future. The theory being that the requester of the event may retry in
|
|
the near future.
|
|
|
|
The Titanium Cloud is not required to offer a vote. Voting may be
|
|
bypassed on certain recovery scenarios.
|
|
|
|
Configuring Guest-Client Notification and Voting ...
|
|
|
|
## The overall time to vote in seconds regardless of the event being voted
|
|
## upon. It should reflect the slowest of all expected voters when in a sane
|
|
## and healthy condition, plus some allowance for scheduling and messaging.
|
|
VOTE=8
|
|
|
|
## The overall time to handle a stop or reboot notification in seconds.
|
|
## It should reflect the slowest of all expected notification handlers
|
|
## when in a sane and healthy condition, plus some allowance for scheduling
|
|
## and messaging.
|
|
SHUTDOWN_NOTICE=8
|
|
|
|
## The overall time to handle a pause, suspend or migrate-begin notification
|
|
## in seconds. It should reflect the slowest of all expected notification
|
|
## handlers when in a sane and healthy condition, plus some allowance for
|
|
## scheduling and messaging.
|
|
SUSPEND_NOTICE=8
|
|
|
|
## The overall time to handle an unpause, resume or migrate-end notification
|
|
## in seconds. It should reflect the slowest of all expected notification
|
|
## handlers when in a sane and healthy condition, plus some allowance for
|
|
## scheduling and messaging. It does not include reboot time.
|
|
RESUME_NOTICE=13
|
|
|
|
## The overall time to reboot, up to the point the guest-client heartbeat
|
|
## starts in seconds. Allow for some I/O contention.
|
|
RESTART=300
|
|
|
|
## The Path to the event notification script. This is optional.
|
|
## The script will be called when an action is initiated that will impact
|
|
## the VM.
|
|
##
|
|
## The event handling script is invoked with two parameters:
|
|
##
|
|
## event_handling_script <MSG_TYPE> <EVENT>
|
|
##
|
|
## MSG_TYPE is one of:
|
|
## 'revocable' Indicating a vote is called for. Return zero to accept,
|
|
## non-zero to reject. For a rejection, the first line of
|
|
## stdout emitted by the script will be captured and logged
|
|
## logged indicating why the event was rejected.
|
|
##
|
|
## 'irrevocable' Indicating this is a notification only. Take preparatory
|
|
## actions and return zero if successful, or non-zero on
|
|
## failure. For a failure, the first line of stdout
|
|
## emitted by the script will be captured and logged
|
|
## indicating the cause of the failure.
|
|
##
|
|
## EVENT is one of: ( 'stop', 'reboot', 'pause', 'unpause', 'suspend',
|
|
## 'resume', 'live_migrate_begin',
|
|
## 'live_migrate_end', 'cold_migrate_begin',
|
|
## 'cold_migrate_end' )
|
|
##
|
|
EVENT_NOTIFICATION_SCRIPT="/etc/guest-client/heartbeat/sample_event_handling_script"
|
|
|
|
|
|
VM Application Setup
|
|
====================
|
|
An application running in the VM may wish to register directly for voting
|
|
and notifications. See the guest_heartbeat_api.h for more details. A
|
|
working example can be found in the guest_client_api source directory in the
|
|
sample_guest_app.c file.
|
|
|
|
To compile the sample-guest-app run ...
|
|
cd wrs-guest-heartbeat-3.0.0
|
|
make sample
|
|
|
|
This will create an executable called sample-guest-app in the 'build'
|
|
directory.
|
|
|
|
When compiling the guest application ...
|
|
|
|
include headers:
|
|
#include <guest_api_types.h>
|
|
#include <guest_ap_debug.h>
|
|
#include <guest_heartbeat_api.h>
|
|
|
|
link with:
|
|
-lguest_common_api -lguest_heartbeat_api
|