Ifat Afek e8abce727c Move Stein specs from approved to implemented

Change-Id: I96df2ebbd9c91a0364888af95a94fccb8c138ae0

2019-02-27 17:06:00 +00:00

11 KiB

Raw Permalink Blame History

Prometheus Datasource Labels Mapping

StoryBoard link (task #28682): https://storyboard.openstack.org/#!/story/2004682

This blueprint describes the method of mapping Prometheus's alert labels into Vitrage, in a way that will allow Vitrage to identify the resource that the alarm was raised on.

Problem description

Vitrage holds an entity graph with resources and alarms. Each alarm should be connected to its resource in the graph, in order to support alarm correlation. In Prometheus, an alert is based on a metric, where each metric can have different Labels. The labels contain enough information to identify the resource, however for each alert Vitrage may need to use one or more different label(s).

The purpose of this blueprint is to define a way for Vitrage to easily determine the identification method for every Prometheus alert.

In the use cases that are described below, we would like to create a Prometheus alarm vertex in Vitrage and connect it with an on edge to the right resource in the graph.

Prometheus alert structure

A Prometheus alert contains several fields, including:

annotations - annotations including title and description of the alert.
labels - a list of one or more labels. The labels are generated from alert rule and alert metrics that the alert is based on.
status - the current status of the alert: firing or resolved.

An example of a Prometheus alert:

{
    "annotations": {
        "description": "The average amount of CPU time spent in idle mode, per second, over the last minute (in seconds)",
        "title": "High average CPU time on idle mode"
    },
    "endsAt": "2018-12-30T09:43:52.589431274Z",
    "generatorURL": "http://devstack-rocky-release-4:9090/graph?g0.expr=100+%2A+%281+-+avg+by%28instance%29+%28irate%28node_cpu_seconds_total%7Bjob%3D%22node%22%2Cmode%3D%22idle%22%7D%5B5m%5D%29%29%29+%3E+20&g0.tab=1",
    "labels": {
        "alertname": "AvgCPUTimeOnIdleMode",
        "instance": "135.248.18.109:9100",
        "severity": "warning"
    },
    "receivers": [
        "vitrage"
    ],
    "startsAt": "2018-12-26T15:22:07.589431274Z",
    "status": "firing"
}

A full description of Prometheus alert structure can be found in prometheus_alert_description

Alerts based on libvirt metrics

All libvirt alerts have the following labels:

instance: holds the hostname/ip of where the exporter is running
domain: libvirt name that the exporter is scraping

Alert based on libvirt CPU metrics

Prometheus alert example

{
    "annotations": {
        "description": "Test alert to test libvirt exporter.\n",
        "title": "High cpu usage on vm"
    },
    "endsAt": "2018-12-30T09:44:05.91446215Z",
    "generatorURL": "http://devstack-rocky-release-4:9090/graph?g0.expr=rate%28libvirt_domain_info_cpu_time_seconds_total%5B1m%5D%29+%2A+10000+%3E+13&g0.tab=1",
    "labels": {
        "alertname": "HighCpuOnVmAlert",
        "domain": "instance-00000004",
        "instance": "135.248.18.109:9177",
        "job": "libvirt",
        "severity": "critical"
    },
    "receivers": [
        "vitrage"
    ],
    "startsAt": "2018-12-26T15:23:05.91446215Z",
    "status": "firing"
}

Vitrage resource

Vitrage resource can be uniquely identified by the instance and domain labels.

Alert based on libvirt network metrics

Prometheus alert example

{
    "annotations": {
        "description": "Another test alert to test libvirt exporter.\n",
        "title": "High traffic on bridge"
    },
    "endsAt": "2018-12-30T09:43:50.91446215Z",
    "generatorURL": "http://devstack-rocky-release-4:9090/graph?g0.expr=rate%28libvirt_domain_interface_stats_receive_bytes_total%5B5m%5D%29+%3E+0&g0.tab=1",
    "labels": {
        "alertname": "HighTrafficOnBridge",
        "domain": "instance-00000004",
        "instance": "135.248.18.109:9177",
        "job": "libvirt",
        "severity": "critical",
        "source_bridge": "br-int",
        "target_device": "tap456ab233-f4"
    },
    "receivers": [
        "vitrage"
    ],
    "startsAt": "2018-12-26T15:22:05.91446215Z",
    "status": "firing"
}

Vitrage resource

Short term: raise the alarm on the node or instance. Vitrage resource can be uniquely identified by the instance and domain labels.
Long term: Vitrage should hold a resource for br-int and the alarm should be connected to that resource. Vitrage resource can be uniquely identified by the instance, domain, source_bridge and target_device labels.

Node metrics

Prometheus alert

All Node metrics have a instance that holds the address of exporter. The exporter can scrape metrics from the instance it is running on. In this case ''instancelabel represents the resource address. Also, It can scrape different metrics not from the instance (e.g. network metrics). In this caseinstanceis just an address of the exporter and other labels indicates to the resource. Alert based on node CPU metric ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ **Prometheus alert** CPU metrics are scraped from the instance soinstancelabel represents the resource address. **Prometheus alert example** .. code-block:: json { "annotations": { "description": "The average amount of CPU time spent in idle mode, per second, over the last minute (in seconds)", "title": "High average CPU time on idle mode" }, "endsAt": "2018-12-30T09:43:52.589431274Z", "generatorURL": "http://devstack-rocky-release-4:9090/graph?g0.expr=100+%2A+%281+-+avg+by%28instance%29+%28irate%28node_cpu_seconds_total%7Bjob%3D%22node%22%2Cmode%3D%22idle%22%7D%5B5m%5D%29%29%29+%3E+20&g0.tab=1", "labels": { "alertname": "AvgCPUTimeOnIdleMode", "instance": "135.248.18.109:9100", "severity": "warning" }, "receivers": [ "vitrage" ], "startsAt": "2018-12-26T15:22:07.589431274Z", "status": "firing" } **Vitrage resource** Vitrage resource can be uniquely identified by theinstancelabel. Alert based on node filesystem metric ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ **Prometheus alert example** .. code-block:: json { "annotations": { "description": "\"Consider ssh'ing into the instance and removing files or clean\ntemp files\"\n", "device": "/dev/vda1", "mount_point": "/", "runbook": "troubleshooting/filesystem_alerts_inodes.md", "title": "High number of inode usage", "value": "92.42%" }, "endsAt": "2018-12-30T09:43:52.589431274Z", "generatorURL": "http://devstack-rocky-release-4:9090/graph?g0.expr=node_filesystem_files_free%7Bfstype%3D~%22%28ext.%7Cxfs%29%22%2Cjob%3D%22node%22%7D+%2F+node_filesystem_files%7Bfstype%3D~%22%28ext.%7Cxfs%29%22%2Cjob%3D%22node%22%7D+%2A+100+%3C%3D+100&g0.tab=1", "labels": { "alertname": "HighInodeUsage", "device": "/dev/vda1", "fstype": "ext4", "instance": "135.248.18.109:9100", "job": "node", "mountpoint": "/", "severity": "critical" }, "receivers": [ "vitrage" ], "startsAt": "2018-12-26T15:22:07.589431274Z", "status": "firing" } **Vitrage resource** * Short term: raise the alarm on the node or instance. Vitrage resource can be uniquely identified by theinstancelabel. * Long term: Vitrage should hold a resource for ext4 and the alarm should be connected to that resource. Vitrage resource can be uniquely identified by theinstance,deviceandfstypelabels. Proposed change =============== A configuration file that maps the Prometheus labels to a corresponding Vitrage resource with specific properties (id or other unique properties). The mapping will most likely be defined by the alert name and other fields. Prometheus configuration file structure --------------------------------------- The configuration file contains a list ofalerts. Each alert containskeyandresource. Thekeycontains labels which uniquely identify each alert. Theresource`` specifies how to identify in Vitrage the resource that the alert is on. It contains one or more Vitrage property names and corresponding Prometheus alert labels.

Configuration file example

alerts:
- key:
    alertname: HighCpuOnVmAlert
    job: libvirt
  resource:
    instance_name: domain
    host_id: instance
- key:
    alertname: HighTrafficOnBridge
    job: libvirt
  resource:
    instance_name: domain
    host_id: instance
- key:
    alertname: AvgCPUTimeOnIdleMode
  resource:
    id: instance
- key:
    alertname: HighInodeUsage
    job: node
  resource:
    id: instance

Alternatives

None

Data model impact

None

REST API impact

None

Versioning impact

None

Other end user impact

None

Deployer impact

TBD

Developer impact

None

Horizon impact

None

Implementation

Assignee(s)

Primary assignee:: 7mode3294 (Muhamad Najjar)

Work Items

Load configuration file and use it in the Prometheus transformer.
Documentations and tests.

Dependencies

None

Testing

Unit tests, functional tests and tempest tests

Documentation Impact

The new configuration will be documented

References

Prometheus datasource: https://github.com/openstack/vitrage/tree/master/vitrage/datasources/prometheus
Prometheus alerting rules: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
Prometheus libvirt exporter: https://github.com/CanonicalLtd/prometheus-openstack-exporter
Prometheus node exporter: https://github.com/prometheus/node_exporter

11 KiB Raw Permalink Blame History

Prometheus Datasource Labels Mapping

Problem description

Prometheus alert structure

Alerts based on libvirt metrics

Alert based on libvirt CPU metrics

Alert based on libvirt network metrics

Node metrics

Alternatives

Data model impact

REST API impact

Versioning impact

Other end user impact

Deployer impact

Developer impact

Horizon impact

Implementation

Assignee(s)

Work Items

Dependencies

Testing

Documentation Impact

References

11 KiB

Raw Permalink Blame History