Merge "copy admin-guide"
This commit is contained in:
commit
1b78978cba
7
doc/source/admin/index.rst
Normal file
7
doc/source/admin/index.rst
Normal file
@ -0,0 +1,7 @@
|
||||
==========================
|
||||
Telemetry Alarming service
|
||||
==========================
|
||||
|
||||
.. toctree::
|
||||
|
||||
telemetry-alarms.rst
|
343
doc/source/admin/telemetry-alarms.rst
Normal file
343
doc/source/admin/telemetry-alarms.rst
Normal file
@ -0,0 +1,343 @@
|
||||
.. _telemetry-alarms:
|
||||
|
||||
======
|
||||
Alarms
|
||||
======
|
||||
|
||||
Alarms provide user-oriented Monitoring-as-a-Service for resources
|
||||
running on OpenStack. This type of monitoring ensures you can
|
||||
automatically scale in or out a group of instances through the
|
||||
Orchestration service, but you can also use alarms for general-purpose
|
||||
awareness of your cloud resources' health.
|
||||
|
||||
These alarms follow a tri-state model:
|
||||
|
||||
ok
|
||||
The rule governing the alarm has been evaluated as ``False``.
|
||||
|
||||
alarm
|
||||
The rule governing the alarm have been evaluated as ``True``.
|
||||
|
||||
insufficient data
|
||||
There are not enough datapoints available in the evaluation periods
|
||||
to meaningfully determine the alarm state.
|
||||
|
||||
Alarm definitions
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
The definition of an alarm provides the rules that govern when a state
|
||||
transition should occur, and the actions to be taken thereon. The
|
||||
nature of these rules depend on the alarm type.
|
||||
|
||||
Threshold rule alarms
|
||||
---------------------
|
||||
|
||||
For conventional threshold-oriented alarms, state transitions are
|
||||
governed by:
|
||||
|
||||
* A static threshold value with a comparison operator such as greater
|
||||
than or less than.
|
||||
|
||||
* A statistic selection to aggregate the data.
|
||||
|
||||
* A sliding time window to indicate how far back into the recent past
|
||||
you want to look.
|
||||
|
||||
Valid threshold alarms are: ``gnocchi_resources_threshold_rule``,
|
||||
``gnocchi_aggregation_by_metrics_threshold_rule``, or
|
||||
``gnocchi_aggregation_by_resources_threshold_rule``.
|
||||
|
||||
.. note::
|
||||
|
||||
As of Ocata, the ``threshold`` alarm is deprecated since Ceilometer's
|
||||
native storage API is deprecated.
|
||||
|
||||
Composite rule alarms
|
||||
---------------------
|
||||
|
||||
Composite alarms enable users to define an alarm with multiple triggering
|
||||
conditions, using a combination of ``and`` and ``or`` relations.
|
||||
|
||||
|
||||
Combination rule alarms
|
||||
-----------------------
|
||||
|
||||
.. note::
|
||||
|
||||
Combination alarms are deprecated as of Newton for composite alarms.
|
||||
Combination alarm functionality is removed in Pike.
|
||||
|
||||
The Telemetry service also supports the concept of a meta-alarm, which
|
||||
aggregates over the current state of a set of underlying basic alarms
|
||||
combined via a logical operator (``and`` or ``or``).
|
||||
|
||||
Alarm dimensioning
|
||||
~~~~~~~~~~~~~~~~~~
|
||||
|
||||
A key associated concept is the notion of *dimensioning* which
|
||||
defines the set of matching meters that feed into an alarm
|
||||
evaluation. Recall that meters are per-resource-instance, so in the
|
||||
simplest case an alarm might be defined over a particular meter
|
||||
applied to all resources visible to a particular user. More useful
|
||||
however would be the option to explicitly select which specific
|
||||
resources you are interested in alarming on.
|
||||
|
||||
At one extreme you might have narrowly dimensioned alarms where this
|
||||
selection would have only a single target (identified by resource
|
||||
ID). At the other extreme, you could have widely dimensioned alarms
|
||||
where this selection identifies many resources over which the
|
||||
statistic is aggregated. For example all instances booted from a
|
||||
particular image or all instances with matching user metadata (the
|
||||
latter is how the Orchestration service identifies autoscaling
|
||||
groups).
|
||||
|
||||
Alarm evaluation
|
||||
~~~~~~~~~~~~~~~~
|
||||
|
||||
Alarms are evaluated by the ``alarm-evaluator`` service on a periodic
|
||||
basis, defaulting to once every minute.
|
||||
|
||||
Alarm actions
|
||||
-------------
|
||||
|
||||
Any state transition of individual alarm (to ``ok``, ``alarm``, or
|
||||
``insufficient data``) may have one or more actions associated with
|
||||
it. These actions effectively send a signal to a consumer that the
|
||||
state transition has occurred, and provide some additional context.
|
||||
This includes the new and previous states, with some reason data
|
||||
describing the disposition with respect to the threshold, the number
|
||||
of datapoints involved and most recent of these. State transitions
|
||||
are detected by the ``alarm-evaluator``, whereas the
|
||||
``alarm-notifier`` effects the actual notification action.
|
||||
|
||||
**Webhooks**
|
||||
|
||||
These are the *de facto* notification type used by Telemetry alarming
|
||||
and simply involve an HTTP POST request being sent to an endpoint,
|
||||
with a request body containing a description of the state transition
|
||||
encoded as a JSON fragment.
|
||||
|
||||
**Log actions**
|
||||
|
||||
These are a lightweight alternative to webhooks, whereby the state
|
||||
transition is simply logged by the ``alarm-notifier``, and are
|
||||
intended primarily for testing purposes.
|
||||
|
||||
Workload partitioning
|
||||
---------------------
|
||||
|
||||
The alarm evaluation process uses the same mechanism for workload
|
||||
partitioning as the central and compute agents. The
|
||||
`Tooz <https://pypi.python.org/pypi/tooz>`_ library provides the
|
||||
coordination within the groups of service instances. For further
|
||||
information about this approach, see the `high availability guide
|
||||
<https://docs.openstack.org/ha-guide/controller-ha-telemetry.html>`_.
|
||||
|
||||
To use this workload partitioning solution set the
|
||||
``evaluation_service`` option to ``default``. For more
|
||||
information, see the alarm section in the
|
||||
`OpenStack Configuration Reference <https://docs.openstack.org/ocata/config-reference/telemetry.html>`_.
|
||||
|
||||
Using alarms
|
||||
~~~~~~~~~~~~
|
||||
|
||||
Alarm creation
|
||||
--------------
|
||||
|
||||
An example of creating a Gnocchi threshold-oriented alarm, based on an upper
|
||||
bound on the CPU utilization for a particular instance:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ aodh alarm create --name cpu_hi \
|
||||
--type gnocchi_resources_threshold \
|
||||
--description 'instance running hot' \
|
||||
--metric cpu_util --threshold 70.0 \
|
||||
--comparison-operator gt --aggregation_method avg \
|
||||
--granularity 600 --evaluation-periods 3 \
|
||||
--alarm-action 'log://' --resource_id INSTANCE_ID
|
||||
|
||||
This creates an alarm that will fire when the average CPU utilization
|
||||
for an individual instance exceeds 70% for three consecutive 10
|
||||
minute periods. The notification in this case is simply a log message,
|
||||
though it could alternatively be a webhook URL.
|
||||
|
||||
.. note::
|
||||
|
||||
Alarm names must be unique for the alarms associated with an
|
||||
individual project. Administrator can limit the maximum
|
||||
resulting actions for three different states, and the
|
||||
ability for a normal user to create ``log://`` and ``test://``
|
||||
notifiers is disabled. This prevents unintentional
|
||||
consumption of disk and memory resources by the
|
||||
Telemetry service.
|
||||
|
||||
The sliding time window over which the alarm is evaluated is 30
|
||||
minutes in this example. This window is not clamped to wall-clock
|
||||
time boundaries, rather it's anchored on the current time for each
|
||||
evaluation cycle, and continually creeps forward as each evaluation
|
||||
cycle rolls around (by default, this occurs every minute).
|
||||
|
||||
.. note::
|
||||
|
||||
The alarm granularity must match the granularities of the metric configured
|
||||
in Gnocchi.
|
||||
|
||||
Otherwise the alarm will tend to flit in and out of the
|
||||
``insufficient data`` state due to the mismatch between the actual
|
||||
frequency of datapoints in the metering store and the statistics
|
||||
queries used to compare against the alarm threshold. If a shorter
|
||||
alarm period is needed, then the corresponding interval should be
|
||||
adjusted in the ``pipeline.yaml`` file.
|
||||
|
||||
Other notable alarm attributes that may be set on creation, or via a
|
||||
subsequent update, include:
|
||||
|
||||
state
|
||||
The initial alarm state (defaults to ``insufficient data``).
|
||||
|
||||
description
|
||||
A free-text description of the alarm (defaults to a synopsis of the
|
||||
alarm rule).
|
||||
|
||||
enabled
|
||||
True if evaluation and actioning is to be enabled for this alarm
|
||||
(defaults to ``True``).
|
||||
|
||||
repeat-actions
|
||||
True if actions should be repeatedly notified while the alarm
|
||||
remains in the target state (defaults to ``False``).
|
||||
|
||||
ok-action
|
||||
An action to invoke when the alarm state transitions to ``ok``.
|
||||
|
||||
insufficient-data-action
|
||||
An action to invoke when the alarm state transitions to
|
||||
``insufficient data``.
|
||||
|
||||
time-constraint
|
||||
Used to restrict evaluation of the alarm to certain times of the
|
||||
day or days of the week (expressed as ``cron`` expression with an
|
||||
optional timezone).
|
||||
|
||||
An example of creating a combination alarm, based on the combined
|
||||
state of two underlying alarms:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ aodh alarm create --name meta --type composite \
|
||||
--composite-rule '{"or":[{"threshold": 0.8,"metric": "cpu_util", "type": \
|
||||
"gnocchi_resources_threshold", "resource_id": INSTANCE_ID, \
|
||||
"aggregation-method": "last"},{"threshold": 0.8,"metric": "cpu_util", \
|
||||
"type": "gnocchi_resources_threshold", "resource_id": INSTANCE_ID2, \
|
||||
"aggregation-method": "last"}]}' \
|
||||
--alarm-action 'http://example.org/notify'
|
||||
|
||||
This creates an alarm that will fire when either one of two underlying
|
||||
alarms transition into the alarm state. The notification in this case
|
||||
is a webhook call. Any number of underlying alarms can be combined in
|
||||
this way, using either ``and`` or ``or``. Additionally, combinations
|
||||
can contain nested conditions:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ aodh alarm create --name meta --type composite \
|
||||
--composite-rule '{"or":[ALARM_1, {"and":[ALARM2, ALARM3]}]}'
|
||||
--alarm-action 'http://example.org/notify'
|
||||
|
||||
|
||||
Alarm retrieval
|
||||
---------------
|
||||
|
||||
You can display all your alarms via (some attributes are omitted for
|
||||
brevity):
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ aodh alarm list
|
||||
+----------+-----------+--------+-------------------+----------+---------+
|
||||
| Alarm ID | Type | Name | State | Severity | Enabled |
|
||||
+----------+-----------+--------+-------------------+----------+---------+
|
||||
| ALARM_ID | threshold | cpu_hi | insufficient data | high | True |
|
||||
+----------+-----------+--------+-------------------+----------+---------+
|
||||
|
||||
In this case, the state is reported as ``insufficient data`` which
|
||||
could indicate that:
|
||||
|
||||
* meters have not yet been gathered about this instance over the
|
||||
evaluation window into the recent past (for example a brand-new
|
||||
instance)
|
||||
|
||||
* *or*, that the identified instance is not visible to the
|
||||
user/project owning the alarm
|
||||
|
||||
* *or*, simply that an alarm evaluation cycle hasn't kicked off since
|
||||
the alarm was created (by default, alarms are evaluated once per
|
||||
minute).
|
||||
|
||||
.. note::
|
||||
|
||||
The visibility of alarms depends on the role and project
|
||||
associated with the user issuing the query:
|
||||
|
||||
* admin users see *all* alarms, regardless of the owner
|
||||
|
||||
* non-admin users see only the alarms associated with their project
|
||||
(as per the normal project segregation in OpenStack)
|
||||
|
||||
Alarm update
|
||||
------------
|
||||
|
||||
Once the state of the alarm has settled down, we might decide that we
|
||||
set that bar too low with 70%, in which case the threshold (or most
|
||||
any other alarm attribute) can be updated thusly:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ aodh alarm update ALARM_ID --threshold 75
|
||||
|
||||
The change will take effect from the next evaluation cycle, which by
|
||||
default occurs every minute.
|
||||
|
||||
Most alarm attributes can be changed in this way, but there is also
|
||||
a convenient short-cut for getting and setting the alarm state:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ openstack alarm state get ALARM_ID
|
||||
$ openstack alarm state set --state ok ALARM_ID
|
||||
|
||||
Over time the state of the alarm may change often, especially if the
|
||||
threshold is chosen to be close to the trending value of the
|
||||
statistic. You can follow the history of an alarm over its lifecycle
|
||||
via the audit API:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ aodh alarm-history show ALARM_ID
|
||||
+------------------+-----------+---------------------------------------+
|
||||
| Type | Timestamp | Detail |
|
||||
+------------------+-----------+---------------------------------------+
|
||||
| creation | time0 | name: cpu_hi |
|
||||
| | | description: instance running hot |
|
||||
| | | type: threshold |
|
||||
| | | rule: cpu_util > 70.0 during 3 x 600s |
|
||||
| state transition | time1 | state: ok |
|
||||
| rule change | time2 | rule: cpu_util > 75.0 during 3 x 600s |
|
||||
+------------------+-----------+---------------------------------------+
|
||||
|
||||
Alarm deletion
|
||||
--------------
|
||||
|
||||
An alarm that is no longer required can be disabled so that it is no
|
||||
longer actively evaluated:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ aodh alarm update --enabled False -a ALARM_ID
|
||||
|
||||
or even deleted permanently (an irreversible step):
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ aodh alarm delete ALARM_ID
|
@ -27,6 +27,7 @@ collected by Ceilometer or Gnocchi.
|
||||
|
||||
install/index
|
||||
contributor/index
|
||||
admin/index
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
Loading…
Reference in New Issue
Block a user