diff --git a/docs/about.html b/docs/about.html index 51afdb7..08c736e 100644 --- a/docs/about.html +++ b/docs/about.html @@ -56,7 +56,20 @@
-

The Parts

+

Topics

+ +

The Parts

Let's spend a few seconds to talk about the StackTach.v3 components and establish some terminology.

@@ -78,7 +91,7 @@
-

Enabling notifications in OpenStack

+

Enabling notifications in OpenStack

In order to get notifications from OpenStack, you'll need to put the following lines in your service configuration files.

@@ -86,13 +99,17 @@
 --notification_driver=nova.openstack.common.notifier.rpc_notifier
 --notification_topics=monitor
+--notify_on_state_change=vm_and_task_state
+--notify_on_any_change=True
+--instance_usage_audit=True
+--instance_usage_audit_period=hour
 

Where "monitor" is the name of the queue you wish to send the notifications to. When you are configuring Yagi, you'll need to ensure the queue name prefix matches what you've defined here.

-

Consuming Notifications with Yagi

+

Consuming Notifications with Yagi

You're going to need a way to get those notifications out of your queuing system. That's what Yagi does. Yagi reads notifications from one place and spits them out somewhere else. It's highly configurable and battle-tested for large scale deployment.

You launch the yagi-events process with the following command:

@@ -133,7 +150,7 @@ max_messages = 100

The important part of this configuration is the [event_worker] section. This says we want to use the RabbitMQ data source. The RabbitMQ connectivity information is stored in the [rabbit_broker] section. The name of each rabbitmq queue to consume from is specified in the [consumers] section. For every queue you define there, you will need a [consumer:<queue_name>] section. This last section is where there real magic happens. Beyond defining the exchange, routing_key and durability characteristics, it defines the chain of Yagi Handlers that will run on every notification that gets consumed.

You can write your own Yagi handlers if you like, but there are a number that ship with StackTach.v3 to do some interesting things. The most important of these is the winchester.yagi_handler:WinchesterHandler. This handler is your entry point into StackTach.v3 stream processing. But first, we need to convert those messy notifications into events ...

-

Distilling Notifications to Events

+

Distilling Notifications to Events

Now we have notifications coming into Winchester. But, as we hinted at above, we need to take the larger notification and distill it down into a, more manageable, event. The stack-distiller module makes this happen. Within StackTach.v3, this is part of winchester.yagi_handler:WinchesterHandler.

A notification is a large, nested JSON data structure. But we don't need all of that data for stream processing. In fact, we generally only require a few Traits from the notification. That's what distilling does. It pulls out the important traits, scrubs the data and uses that. Distillations are done via the distillation configuration file (specified in winchester.conf).

@@ -259,7 +276,7 @@ max_messages = 100 -

Streams

+

Streams

Streams are the key to StackTach.v3. You should have a good understanding about the lifecycle of a stream and how to define a stream. So let's start with some basics ...

@@ -285,7 +302,7 @@ pipeline_handlers:
-

Telling Yagi WinchesterHandler where to find the Winchester config file.

+

Telling Yagi WinchesterHandler where to find the Winchester config file.

We left that little detail out when we were explaining Yagi previously. But the WinchesterHandler needs to know where your winchester config file lives. You define this by adding a [winchester] section to your yagi config file. @@ -421,6 +438,167 @@ winchester.debugging[INFO line: 161] ----------------------------

The winchester.debugging lines will tell you how fire and matching criteria is progressing. In this case, it's saying that 397 firing criteria checks were made and only 2 passed. If your debug level is 2, you will get breakdown of the reasons the checks failed. You can use this information to review your trigger definitions and see if something could be wrong. Additionally, the matching criteria results are detailed. In this case we see that, of 207 events, 200 were acceptable. The details on the 7 rejected are listed below. Finally, some "counters" are supplied on the stream processing in general. 58 new streams were created on this pass, 100 new events added to various "test_trigger" streams, and 1 stream is ready to fire.

By selectively turning on per-stream debugging, you can quickly find processing problems and ignore a lot of log noise.

+

Winchester Pipeline Handlers

+

Winchester comes with a set of stock pipeline handlers for the + most popular OpenStack operations.

+

The UsageHandler

+

The UsageHandler is a pipeline handler for determining the daily usage of every instance with an OpenStack Nova deployment. The usage handler is cells-aware so it can support large deployments.

+

The useage handler requires a stream per instance per day. It triggers when the compute.instance.exists event is seen. Audit notifications should be enabled within Nova. See the samples for an example of a usage stream definition.

+

Once triggered, the usage handler will compare the daily transactional events for every instance against the various .exists records for that instance. If nothing happens to an instance within that 24-hour period, an end-of-day .exists notification is sent from Nova. Nova operations that change the launched_at date for an instance will issue additional .exists records. These include create, delete, resize and rebuild operations. If the transactional events for the instance match the values in the .exists event, a compute.instance.exists.verified notification is created, otherwise a compute.instance.exists.failed and/or compute.instance.exists.warnings notifications are created. When coupled with the NotabeneHandler, these new notifications can be republished to the queue for subsequent processing.

+

The schema of these new notifications are as follows:

+ + + compute.instance.exists.verified +
+{     
+    'event_type': human readable name of event (eg: foo.blah.zoo)
+    'message_id': unique message id (uuid)
+    'timestamp': datetime this notification was generated at source
+    'stream_id': stream id
+    'original_message_id': message_id of .exists event
+    'payload': {
+      'audit_period_beginning': start datetime of audit period
+      'audit_period_ending': ending datetime of audit period
+      'launched_at': datetime this instance was launched
+      'deleted_at': datatime this instance was deleted
+      'instance_id': instance uuid
+      'tenant_id': tenant id
+      'display_name': instance display name
+      'instance_type': instance flavor type description
+      'instance_flavor_id': instance flavor type id
+      'state': instance vm power state
+      'state_description': human readable instance vm power state
+      'bandwidth': {
+         'public': {
+           'bw_in': incoming bandwidth
+           'bw_out': outgoing bandwidth
+         }
+      },
+      'image_meta': {
+        'org.openstack__1__architecture': image architecture
+        'org.openstack__1__os_version': image version
+        'org.openstack__1__os_distro': image distribution
+        'org.rackspace__1__options': service provider specific (opt)
+      }
+    },
+}
+      
+ + compute.instance.exists.failed +
+{     
+    'event_type': human readable name of event (eg: foo.blah.zoo)
+    'message_id': unique message id (uuid)
+    'timestamp': datetime this notification was generated at source
+    'stream_id': stream id
+    'original_message_id': message_id of .exists event
+    'error': human readable explaination for verification failure
+    'error_code': numeric error code (see below)
+    'payload': {
+      'audit_period_beginning': start datetime of audit period
+      'audit_period_ending': ending datetime of audit period
+      'launched_at': datetime this instance was launched
+      'deleted_at': datatime this instance was deleted
+      'instance_id': instance uuid
+      'tenant_id': tenant id
+      'display_name': instance display name
+      'instance_type': instance flavor type description
+      'instance_flavor_id': instance flavor type id
+      'state': instance vm power state
+      'state_description': human readable instance vm power state
+      'bandwidth': {
+         'public': {
+           'bw_in': incoming bandwidth
+           'bw_out': outgoing bandwidth
+         }
+      },
+      'image_meta': {
+        'org.openstack__1__architecture': image architecture
+        'org.openstack__1__os_version': image version
+        'org.openstack__1__os_distro': image distribution
+        'org.rackspace__1__options': service provider specific (opt)
+      }
+    },
+}
+      
+ +

Tests currently performed by the UsageHandler include:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Error CodeMessageExplanation
U1.exists has no launched_at value.We received a .exists event that has no launched_at value set.
U2Conflicting '[trait]' values ('value1' != 'value2')A trait in the .exists record does not match the value of the related transactional event.
U3.exists state not 'deleted' but .exists deleted_at is set.Nova says the instance is deleted, but the deleted_at trait isn't defined.
U4.exists deleted_at less than .exists launched_at.The deleted_at trait is earlier than when the instance was launched.
U5.exists deleted_at in audit period, but no matching .deleted event found.The deleted_at trait falls within the last 24hrs, but we didn't receive any .deleted events in that time frame.
U6.deleted events found but .exists has no deleted_at value.We received transactional .deleted events, but the deleted_at trait in the .exists event is not defined.
U7Multiple .delete.end eventsWe should only get one compute.instance.delete.end event.
U8.exists launched_at in audit period, but no related events found.We received a .exists event that has the launched_at trait within the last 24hrs, but there were no transactional events in that time frame.
+ + + compute.instance.exists.warnings +
+{     
+    'event_type': human readable name of event (eg: foo.blah.zoo)
+    'message_id': unique message id (uuid)
+    'timestamp': datetime this notification was generated at source
+    'instance_id': instance uuid
+    'stream_id': stream id
+    'warnings': [list of human readable warning messages]
+}
+      
+ + +

The NotabeneHandler

+ +

The NotabeneHandler will take any new notifications (not events) it finds in the pipeline Environment variable and publish them to the rabbitmq exchange specified. The handler will look ofor a key/value in the pipeline environment (passed into the handler on the handle_events() call).

+

In your pipeline definition, you can set the configuration for the NotabeneHandler as shown below. Note how the enviroment variable keys are defined by the env_keys value. This can be a list of keys. Any new notifications this handler finds in those variables will get published to the RabbitMQ exchange specified in the rest of the configuration. The queue_name is also critical so we know which topic to publish to. In OpenStack, the routing key is the queue name. The notabene handler does connection pooling to the various queues, so specifying many different servers is not expensive.

+ +

Because these environment keys have to be set before the notabene handler is called, it has to be one of the last handlers in the pipeline. The UsageHandler adds new notifications to the usage_notifications key. If the notabene handler is not part of the pipeline, these new notifications are dropped when the pipeline is finished.

+ +
+test_expire_pipeline:
+    - logger
+    - usage
+    - name: notabene
+      params: 
+        host: localhost
+        user: guest
+        password: guest
+        port: 5672
+        vhost: /
+        library: librabbitmq
+        exchange: nova
+        exchange_type: topic
+        queue_name: monitor.info
+        env_keys:
+            - usage_notifications
+      
+ diff --git a/winchester/triggers.yaml b/winchester/triggers.yaml index fb99be6..198166f 100644 --- a/winchester/triggers.yaml +++ b/winchester/triggers.yaml @@ -15,6 +15,9 @@ - rebuild_instance - compute.instance.* - "!compute.instance.exists" + - "!compute.instance.exists.failed" + - "!compute.instance.exists.warnings" + - "!compute.instance.exists.verified" - event_type: compute.instance.exists map_distinguished_by: timestamp: audit_period_beginning