Merge "copy admin-guide"

2017-07-13 17:53:26 +00:00 · 2017-07-13 17:53:26 +00:00 · 1b78978cba
commit 1b78978cba
parent 990414712f c9f2c43dae
3 changed files with 351 additions and 0 deletions
--- a/doc/source/admin/index.rst
+++ b/doc/source/admin/index.rst
@ -0,0 +1,7 @@
+==========================
+Telemetry Alarming service
+==========================
+
+.. toctree::
+
+   telemetry-alarms.rst
--- a/doc/source/admin/telemetry-alarms.rst
+++ b/doc/source/admin/telemetry-alarms.rst
@ -0,0 +1,343 @@
+.. _telemetry-alarms:
+
+======
+Alarms
+======
+
+Alarms provide user-oriented Monitoring-as-a-Service for resources
+running on OpenStack. This type of monitoring ensures you can
+automatically scale in or out a group of instances through the
+Orchestration service, but you can also use alarms for general-purpose
+awareness of your cloud resources' health.
+
+These alarms follow a tri-state model:
+
+ok
+  The rule governing the alarm has been evaluated as ``False``.
+
+alarm
+  The rule governing the alarm have been evaluated as ``True``.
+
+insufficient data
+  There are not enough datapoints available in the evaluation periods
+  to meaningfully determine the alarm state.
+
+Alarm definitions
+~~~~~~~~~~~~~~~~~
+
+The definition of an alarm provides the rules that govern when a state
+transition should occur, and the actions to be taken thereon. The
+nature of these rules depend on the alarm type.
+
+Threshold rule alarms
+---------------------
+
+For conventional threshold-oriented alarms, state transitions are
+governed by:
+
+* A static threshold value with a comparison operator such as greater
+  than or less than.
+
+* A statistic selection to aggregate the data.
+
+* A sliding time window to indicate how far back into the recent past
+  you want to look.
+
+Valid threshold alarms are: ``gnocchi_resources_threshold_rule``,
+``gnocchi_aggregation_by_metrics_threshold_rule``, or
+``gnocchi_aggregation_by_resources_threshold_rule``.
+
+.. note::
+
+  As of Ocata, the ``threshold`` alarm is deprecated since Ceilometer's
+  native storage API is deprecated.
+
+Composite rule alarms
+---------------------
+
+Composite alarms enable users to define an alarm with multiple triggering
+conditions, using a combination of ``and`` and ``or`` relations.
+
+
+Combination rule alarms
+-----------------------
+
+.. note::
+
+   Combination alarms are deprecated as of Newton for composite alarms.
+   Combination alarm functionality is removed in Pike.
+
+The Telemetry service also supports the concept of a meta-alarm, which
+aggregates over the current state of a set of underlying basic alarms
+combined via a logical operator (``and`` or ``or``).
+
+Alarm dimensioning
+~~~~~~~~~~~~~~~~~~
+
+A key associated concept is the notion of *dimensioning* which
+defines the set of matching meters that feed into an alarm
+evaluation. Recall that meters are per-resource-instance, so in the
+simplest case an alarm might be defined over a particular meter
+applied to all resources visible to a particular user. More useful
+however would be the option to explicitly select which specific
+resources you are interested in alarming on.
+
+At one extreme you might have narrowly dimensioned alarms where this
+selection would have only a single target (identified by resource
+ID). At the other extreme, you could have widely dimensioned alarms
+where this selection identifies many resources over which the
+statistic is aggregated. For example all instances booted from a
+particular image or all instances with matching user metadata (the
+latter is how the Orchestration service identifies autoscaling
+groups).
+
+Alarm evaluation
+~~~~~~~~~~~~~~~~
+
+Alarms are evaluated by the ``alarm-evaluator`` service on a periodic
+basis, defaulting to once every minute.
+
+Alarm actions
+-------------
+
+Any state transition of individual alarm (to ``ok``, ``alarm``, or
+``insufficient data``) may have one or more actions associated with
+it. These actions effectively send a signal to a consumer that the
+state transition has occurred, and provide some additional context.
+This includes the new and previous states, with some reason data
+describing the disposition with respect to the threshold, the number
+of datapoints involved and most recent of these. State transitions
+are detected by the ``alarm-evaluator``, whereas the
+``alarm-notifier`` effects the actual notification action.
+
+**Webhooks**
+
+These are the *de facto* notification type used by Telemetry alarming
+and simply involve an HTTP POST request being sent to an endpoint,
+with a request body containing a description of the state transition
+encoded as a JSON fragment.
+
+**Log actions**
+
+These are a lightweight alternative to webhooks, whereby the state
+transition is simply logged by the ``alarm-notifier``, and are
+intended primarily for testing purposes.
+
+Workload partitioning
+---------------------
+
+The alarm evaluation process uses the same mechanism for workload
+partitioning as the central and compute agents. The
+`Tooz <https://pypi.python.org/pypi/tooz>`_ library provides the
+coordination within the groups of service instances. For further
+information about this approach, see the `high availability guide
+<https://docs.openstack.org/ha-guide/controller-ha-telemetry.html>`_.
+
+To use this workload partitioning solution set the
+``evaluation_service`` option to ``default``. For more
+information, see the alarm section in the
+`OpenStack Configuration Reference <https://docs.openstack.org/ocata/config-reference/telemetry.html>`_.
+
+Using alarms
+~~~~~~~~~~~~
+
+Alarm creation
+--------------
+
+An example of creating a Gnocchi threshold-oriented alarm, based on an upper
+bound on the CPU utilization for a particular instance:
+
+.. code-block:: console
+
+   $ aodh alarm create --name cpu_hi \
+     --type gnocchi_resources_threshold \
+     --description 'instance running hot' \
+     --metric cpu_util --threshold 70.0 \
+     --comparison-operator gt --aggregation_method avg \
+     --granularity 600 --evaluation-periods 3 \
+     --alarm-action 'log://' --resource_id INSTANCE_ID
+
+This creates an alarm that will fire when the average CPU utilization
+for an individual instance exceeds 70% for three consecutive 10
+minute periods. The notification in this case is simply a log message,
+though it could alternatively be a webhook URL.
+
+.. note::
+
+    Alarm names must be unique for the alarms associated with an
+    individual project. Administrator can limit the maximum
+    resulting actions for three different states, and the
+    ability for a normal user to create ``log://`` and ``test://``
+    notifiers is disabled. This prevents unintentional
+    consumption of disk and memory resources by the
+    Telemetry service.
+
+The sliding time window over which the alarm is evaluated is 30
+minutes in this example. This window is not clamped to wall-clock
+time boundaries, rather it's anchored on the current time for each
+evaluation cycle, and continually creeps forward as each evaluation
+cycle rolls around (by default, this occurs every minute).
+
+.. note::
+
+   The alarm granularity must match the granularities of the metric configured
+   in Gnocchi.
+
+Otherwise the alarm will tend to flit in and out of the
+``insufficient data`` state due to the mismatch between the actual
+frequency of datapoints in the metering store and the statistics
+queries used to compare against the alarm threshold. If a shorter
+alarm period is needed, then the corresponding interval should be
+adjusted in the ``pipeline.yaml`` file.
+
+Other notable alarm attributes that may be set on creation, or via a
+subsequent update, include:
+
+state
+  The initial alarm state (defaults to ``insufficient data``).
+
+description
+  A free-text description of the alarm (defaults to a synopsis of the
+  alarm rule).
+
+enabled
+  True if evaluation and actioning is to be enabled for this alarm
+  (defaults to ``True``).
+
+repeat-actions
+  True if actions should be repeatedly notified while the alarm
+  remains in the target state (defaults to ``False``).
+
+ok-action
+  An action to invoke when the alarm state transitions to ``ok``.
+
+insufficient-data-action
+  An action to invoke when the alarm state transitions to
+  ``insufficient data``.
+
+time-constraint
+  Used to restrict evaluation of the alarm to certain times of the
+  day or days of the week (expressed as ``cron`` expression with an
+  optional timezone).
+
+An example of creating a combination alarm, based on the combined
+state of two underlying alarms:
+
+.. code-block:: console
+
+   $ aodh alarm create --name meta --type composite \
+     --composite-rule '{"or":[{"threshold": 0.8,"metric": "cpu_util", "type": \
+     "gnocchi_resources_threshold", "resource_id": INSTANCE_ID, \
+     "aggregation-method": "last"},{"threshold": 0.8,"metric": "cpu_util", \
+     "type": "gnocchi_resources_threshold", "resource_id": INSTANCE_ID2, \
+     "aggregation-method": "last"}]}' \
+     --alarm-action 'http://example.org/notify'
+
+This creates an alarm that will fire when either one of two underlying
+alarms transition into the alarm state. The notification in this case
+is a webhook call. Any number of underlying alarms can be combined in
+this way, using either ``and`` or ``or``. Additionally, combinations
+can contain nested conditions:
+
+.. code-block:: console
+
+   $ aodh alarm create --name meta --type composite \
+     --composite-rule '{"or":[ALARM_1, {"and":[ALARM2, ALARM3]}]}'
+     --alarm-action 'http://example.org/notify'
+
+
+Alarm retrieval
+---------------
+
+You can display all your alarms via (some attributes are omitted for
+brevity):
+
+.. code-block:: console
+
+   $ aodh alarm list
+   +----------+-----------+--------+-------------------+----------+---------+
+   | Alarm ID | Type      | Name   | State             | Severity | Enabled |
+   +----------+-----------+--------+-------------------+----------+---------+
+   | ALARM_ID | threshold | cpu_hi | insufficient data | high     | True    |
+   +----------+-----------+--------+-------------------+----------+---------+
+
+In this case, the state is reported as ``insufficient data`` which
+could indicate that:
+
+* meters have not yet been gathered about this instance over the
+  evaluation window into the recent past (for example a brand-new
+  instance)
+
+* *or*, that the identified instance is not visible to the
+  user/project owning the alarm
+
+* *or*, simply that an alarm evaluation cycle hasn't kicked off since
+  the alarm was created (by default, alarms are evaluated once per
+  minute).
+
+.. note::
+
+   The visibility of alarms depends on the role and project
+   associated with the user issuing the query:
+
+   * admin users see *all* alarms, regardless of the owner
+
+   * non-admin users see only the alarms associated with their project
+     (as per the normal project segregation in OpenStack)
+
+Alarm update
+------------
+
+Once the state of the alarm has settled down, we might decide that we
+set that bar too low with 70%, in which case the threshold (or most
+any other alarm attribute) can be updated thusly:
+
+.. code-block:: console
+
+   $ aodh alarm update ALARM_ID --threshold 75
+
+The change will take effect from the next evaluation cycle, which by
+default occurs every minute.
+
+Most alarm attributes can be changed in this way, but there is also
+a convenient short-cut for getting and setting the alarm state:
+
+.. code-block:: console
+
+   $ openstack alarm state get ALARM_ID
+   $ openstack alarm state set --state ok ALARM_ID
+
+Over time the state of the alarm may change often, especially if the
+threshold is chosen to be close to the trending value of the
+statistic. You can follow the history of an alarm over its lifecycle
+via the audit API:
+
+.. code-block:: console
+
+   $ aodh alarm-history show ALARM_ID
+   +------------------+-----------+---------------------------------------+
+   | Type             | Timestamp | Detail                                |
+   +------------------+-----------+---------------------------------------+
+   | creation         | time0     | name: cpu_hi                          |
+   |                  |           | description: instance running hot     |
+   |                  |           | type: threshold                       |
+   |                  |           | rule: cpu_util > 70.0 during 3 x 600s |
+   | state transition | time1     | state: ok                             |
+   | rule change      | time2     | rule: cpu_util > 75.0 during 3 x 600s |
+   +------------------+-----------+---------------------------------------+
+
+Alarm deletion
+--------------
+
+An alarm that is no longer required can be disabled so that it is no
+longer actively evaluated:
+
+.. code-block:: console
+
+   $ aodh alarm update --enabled False -a ALARM_ID
+
+or even deleted permanently (an irreversible step):
+
+.. code-block:: console
+
+   $ aodh alarm delete ALARM_ID
--- a/doc/source/index.rst
+++ b/doc/source/index.rst
@ -27,6 +27,7 @@ collected by Ceilometer or Gnocchi.

   install/index
   contributor/index
+   admin/index

 .. toctree::
   :maxdepth: 1