UsageHandler and NotabeneHandler docs.

Includes a description of failure conditions.

Change-Id: I312c1d4685ba1138d1478ffc1d424fc2216a450c
This commit is contained in:
Sandy Walsh 2015-03-08 20:42:00 -07:00
parent 1d5f7d21fd
commit a5aad8bfce
2 changed files with 187 additions and 6 deletions

View File

@ -56,7 +56,20 @@
<div class="row marketing">
<div class="col-lg-12">
<h3>The Parts</h3>
<h3>Topics</h3>
<ul>
<li><a href='#overview'>Overview</a></li>
<li><a href='#enabling'>Enabling notifications in OpenStack</a></li>
<li><a href='#consuming'>Consuming notifications with Yagi</a></li>
<li><a href='#distill'>Distilling notifications into events</a></li>
<li><a href='#streams'>Streams</a></li>
<li><a href='#config'>Winchester config files</a></li>
<li><a href='#pipeline'>Winchester pipeline handlers</a></li>
<ul>
<li><a href='#usage'>Usage Handler</a></li>
</ul>
</ul>
<h3><a id='overview'>The Parts</a></h3>
<p>Let's spend a few seconds to talk about the StackTach.v3 components and establish some terminology.</p>
<img src='v3_arch.gif' class="img-rounded"/>
@ -78,7 +91,7 @@
<div class="panel panel-info">
<div class="panel-heading">
<h3 class="panel-title">Enabling notifications in OpenStack</h3>
<h3 class="panel-title"><a id='enabling'>Enabling notifications in OpenStack</a></h3>
</div>
<div class="panel-body">
<p>In order to get notifications from OpenStack, you'll need to put the following lines in your service configuration files.</p>
@ -86,13 +99,17 @@
<pre>
--notification_driver=nova.openstack.common.notifier.rpc_notifier
--notification_topics=monitor
--notify_on_state_change=vm_and_task_state
--notify_on_any_change=True
--instance_usage_audit=True
--instance_usage_audit_period=hour
</pre>
<p>Where "monitor" is the name of the queue you wish to send the notifications to. When you are configuring Yagi, you'll need to ensure the queue name prefix matches what you've defined here.</p>
</div>
</div>
<h3>Consuming Notifications with Yagi</h3>
<h3><a id='consuming'>Consuming Notifications with Yagi</a></h3>
<p>You're going to need a way to get those notifications out of your queuing system. That's what Yagi does. Yagi reads notifications from one place and spits them out somewhere else. It's highly configurable and battle-tested for large scale deployment.</p>
<p>You launch the <code>yagi-events</code> process with the following command:</p>
@ -133,7 +150,7 @@ max_messages = 100
<p>The important part of this configuration is the <code>[event_worker]</code> section. This says we want to use the RabbitMQ data source. The RabbitMQ connectivity information is stored in the <code>[rabbit_broker]</code> section. The name of each rabbitmq queue to consume from is specified in the <code>[consumers]</code> section. For every queue you define there, you will need a <code>[consumer:&lt;queue_name&gt;]</code> section. This last section is where there real magic happens. Beyond defining the exchange, routing_key and durability characteristics, it defines the chain of <code>Yagi Handlers</code> that will run on every notification that gets consumed. </p>
<p>You can write your own Yagi handlers if you like, but there are a number that ship with StackTach.v3 to do some interesting things. The most important of these is the <a href='https://github.com/stackforge/stacktach-winchester/blob/4875e419a66974e416dbe3b43ed286017bad1ec4/winchester/yagi_handler.py#L18'>winchester.yagi_handler:WinchesterHandler</a>. This handler is your entry point into StackTach.v3 stream processing. But first, we need to convert those messy notifications into events ...</p>
<h3>Distilling Notifications to Events</h3>
<h3><a id='distill'>Distilling Notifications to Events</a></h3>
<p>Now we have notifications coming into Winchester. But, as we hinted at above, we need to take the larger notification and <i>distill</i> it down into a, more manageable, event. The stack-distiller module makes this happen. Within StackTach.v3, this is part of <code>winchester.yagi_handler:WinchesterHandler</code>.</p>
<p>A notification is a large, nested JSON data structure. But we don't need all of that data for stream processing. In fact, we generally only require a few <a href='glossary.html#trait'>Traits</a> from the notification. That's what distilling does. It pulls out the important traits, scrubs the data and uses that. Distillations are done via the distillation configuration file (specified in winchester.conf). </p>
@ -259,7 +276,7 @@ max_messages = 100
</table>
<h3>Streams</h3>
<h3><a id='streams'>Streams</a></h3>
<div class="alert alert-success" role="alert">Congrats! Now that you have events, you can start to make Streams!</div>
<p>Streams are the key to StackTach.v3. You should have a good understanding about the lifecycle of a stream and how to define a stream. So let's start with some basics ... </p>
@ -285,7 +302,7 @@ pipeline_handlers:
<div class="panel panel-info">
<div class="panel-heading">
<h3 class="panel-title">Telling Yagi WinchesterHandler where to find the Winchester config file.</h3>
<h3 class="panel-title"><a id='config'>Telling Yagi WinchesterHandler where to find the Winchester config file.</a></h3>
</div>
<div class="panel-body">
<p>We left that little detail out when we were explaining Yagi previously. But the WinchesterHandler needs to know where your winchester config file lives. You define this by adding a <code>[winchester]</code> section to your yagi config file.
@ -421,6 +438,167 @@ winchester.debugging[INFO line: 161] ----------------------------
<p>The winchester.debugging lines will tell you how fire and matching criteria is progressing. In this case, it's saying that 397 firing criteria checks were made and only 2 passed. If your debug level is 2, you will get breakdown of the reasons the checks failed. You can use this information to review your trigger definitions and see if something could be wrong. Additionally, the matching criteria results are detailed. In this case we see that, of 207 events, 200 were acceptable. The details on the 7 rejected are listed below. Finally, some "counters" are supplied on the stream processing in general. 58 new streams were created on this pass, 100 new events added to various "test_trigger" streams, and 1 stream is ready to fire.<p>
<p>By selectively turning on per-stream debugging, you can quickly find processing problems and ignore a lot of log noise.</p>
<h3><a id='pipeline'>Winchester Pipeline Handlers</a></h3>
<p>Winchester comes with a set of stock pipeline handlers for the
most popular OpenStack operations.</p>
<h4><a id='usage'>The UsageHandler</a></h4>
<p>The UsageHandler is a pipeline handler for determining the daily usage of every instance with an OpenStack Nova deployment. The usage handler is cells-aware so it can support large deployments.</p>
<p>The useage handler requires a stream per instance per day. It triggers when the <code>compute.instance.exists</code> event is seen. Audit notifications should be <a href='#enabling'>enabled</a> within Nova. See the samples for an example of a usage stream definition.</p>
<p>Once triggered, the usage handler will compare the daily transactional events for every instance against the various .exists records for that instance. If nothing happens to an instance within that 24-hour period, an end-of-day .exists notification is sent from Nova. Nova operations that change the <code>launched_at</code> date for an instance will issue additional .exists records. These include create, delete, resize and rebuild operations. If the transactional events for the instance match the values in the .exists event, a <code>compute.instance.exists.verified</code> notification is created, otherwise a <code>compute.instance.exists.failed</code> and/or <code>compute.instance.exists.warnings</code> notifications are created. When coupled with the NotabeneHandler, these new notifications can be republished to the queue for subsequent processing.<p>
<p>The schema of these new notifications are as follows:</p>
<span class="label label-default">compute.instance.exists.verified</span>
<pre>
{
'event_type': human readable name of event (eg: foo.blah.zoo)
'message_id': unique message id (uuid)
'timestamp': datetime this notification was generated at source
'stream_id': stream id
'original_message_id': message_id of .exists event
'payload': {
'audit_period_beginning': start datetime of audit period
'audit_period_ending': ending datetime of audit period
'launched_at': datetime this instance was launched
'deleted_at': datatime this instance was deleted
'instance_id': instance uuid
'tenant_id': tenant id
'display_name': instance display name
'instance_type': instance flavor type description
'instance_flavor_id': instance flavor type id
'state': instance vm power state
'state_description': human readable instance vm power state
'bandwidth': {
'public': {
'bw_in': incoming bandwidth
'bw_out': outgoing bandwidth
}
},
'image_meta': {
'org.openstack__1__architecture': image architecture
'org.openstack__1__os_version': image version
'org.openstack__1__os_distro': image distribution
'org.rackspace__1__options': service provider specific (opt)
}
},
}
</pre>
<span class="label label-default">compute.instance.exists.failed</span>
<pre>
{
'event_type': human readable name of event (eg: foo.blah.zoo)
'message_id': unique message id (uuid)
'timestamp': datetime this notification was generated at source
'stream_id': stream id
'original_message_id': message_id of .exists event
'error': human readable explaination for verification failure
'error_code': numeric error code (see below)
'payload': {
'audit_period_beginning': start datetime of audit period
'audit_period_ending': ending datetime of audit period
'launched_at': datetime this instance was launched
'deleted_at': datatime this instance was deleted
'instance_id': instance uuid
'tenant_id': tenant id
'display_name': instance display name
'instance_type': instance flavor type description
'instance_flavor_id': instance flavor type id
'state': instance vm power state
'state_description': human readable instance vm power state
'bandwidth': {
'public': {
'bw_in': incoming bandwidth
'bw_out': outgoing bandwidth
}
},
'image_meta': {
'org.openstack__1__architecture': image architecture
'org.openstack__1__os_version': image version
'org.openstack__1__os_distro': image distribution
'org.rackspace__1__options': service provider specific (opt)
}
},
}
</pre>
<p>Tests currently performed by the UsageHandler include:</p>
<table class="table table-bordered">
<thead><tr><th>Error Code</th><th>Message</th><th>Explanation</th></tr></thead>
<tr><td>U1</td>
<td>.exists has no launched_at value.</td>
<td>We received a .exists event that has no launched_at value set.</td></tr>
<tr><td>U2</td>
<td>Conflicting '[trait]' values ('value1' != 'value2')</td>
<td>A trait in the .exists record does not match the value of the related transactional event.</td></tr>
<tr><td>U3</td>
<td>.exists state not 'deleted' but .exists deleted_at is set.</td>
<td>Nova says the instance is deleted, but the deleted_at trait isn't defined.</td></tr>
<tr><td>U4</td>
<td>.exists deleted_at less than .exists launched_at.</td>
<td>The deleted_at trait is earlier than when the instance was launched.</td></tr>
<tr><td>U5</td>
<td>.exists deleted_at in audit period, but no matching .deleted event found.</td>
<td>The deleted_at trait falls within the last 24hrs, but we didn't receive any .deleted events in that time frame.</td></tr>
<tr><td>U6</td>
<td>.deleted events found but .exists has no deleted_at value.</td>
<td>We received transactional .deleted events, but the deleted_at trait in the .exists event is not defined.</td></tr>
<tr><td>U7</td>
<td>Multiple .delete.end events</td>
<td>We should only get one compute.instance.delete.end event.</td></tr>
<tr><td>U8</td>
<td>.exists launched_at in audit period, but no related events found.</td>
<td>We received a .exists event that has the launched_at trait within the last 24hrs, but there were no transactional events in that time frame.</td></tr>
</table>
<span class="label label-default">compute.instance.exists.warnings</span>
<pre>
{
'event_type': human readable name of event (eg: foo.blah.zoo)
'message_id': unique message id (uuid)
'timestamp': datetime this notification was generated at source
'instance_id': instance uuid
'stream_id': stream id
'warnings': [list of human readable warning messages]
}
</pre>
<h4><a id='notabene'>The NotabeneHandler</a></h4>
<p>The NotabeneHandler will take any new notifications (not events) it finds in the pipeline Environment variable and publish them to the rabbitmq exchange specified. The handler will look ofor a key/value in the pipeline environment (passed into the handler on the handle_events() call).<p>
<p>In your pipeline definition, you can set the configuration for the NotabeneHandler as shown below. Note how the enviroment variable keys are defined by the <code>env_keys</code> value. This can be a list of keys. Any new notifications this handler finds in those variables will get published to the RabbitMQ exchange specified in the rest of the configuration. The <code>queue_name</code> is also critical so we know which topic to publish to. In OpenStack, the routing key is the queue name. The notabene handler does connection pooling to the various queues, so specifying many different servers is not expensive.</p>
<p>Because these environment keys have to be set before the notabene handler is called, it has to be one of the last handlers in the pipeline. The UsageHandler adds new notifications to the <code>usage_notifications</code> key. If the notabene handler is not part of the pipeline, these new notifications are dropped when the pipeline is finished.</p>
<pre>
test_expire_pipeline:
- logger
- usage
- name: notabene
params:
host: localhost
user: guest
password: guest
port: 5672
vhost: /
library: librabbitmq
exchange: nova
exchange_type: topic
queue_name: monitor.info
env_keys:
- usage_notifications
</pre>
<footer class="footer">
<p>&copy; Dark Secret Software Inc. 2014</p>
</footer>

View File

@ -15,6 +15,9 @@
- rebuild_instance
- compute.instance.*
- "!compute.instance.exists"
- "!compute.instance.exists.failed"
- "!compute.instance.exists.warnings"
- "!compute.instance.exists.verified"
- event_type: compute.instance.exists
map_distinguished_by:
timestamp: audit_period_beginning