Change-Id: I96df2ebbd9c91a0364888af95a94fccb8c138ae0
11 KiB
Prometheus Datasource Labels Mapping
StoryBoard link (task #28682): https://storyboard.openstack.org/#!/story/2004682
This blueprint describes the method of mapping Prometheus's alert labels into Vitrage, in a way that will allow Vitrage to identify the resource that the alarm was raised on.
Problem description
Vitrage holds an entity graph with resources and alarms. Each alarm
should be connected to its resource in the graph, in order to support
alarm correlation. In Prometheus, an alert is based on a metric, where
each metric can have different Labels
. The labels contain
enough information to identify the resource, however for each alert
Vitrage may need to use one or more different label(s).
The purpose of this blueprint is to define a way for Vitrage to easily determine the identification method for every Prometheus alert.
In the use cases that are described below, we would like to create a
Prometheus alarm vertex in Vitrage and connect it with an
on
edge to the right resource in the graph.
Prometheus alert structure
A Prometheus alert contains several fields, including:
annotations
- annotations includingtitle
anddescription
of the alert.labels
- a list of one or more labels. The labels are generated from alert rule and alert metrics that the alert is based on.status
- the current status of the alert:firing
orresolved
.
An example of a Prometheus alert:
{
"annotations": {
"description": "The average amount of CPU time spent in idle mode, per second, over the last minute (in seconds)",
"title": "High average CPU time on idle mode"
},
"endsAt": "2018-12-30T09:43:52.589431274Z",
"generatorURL": "http://devstack-rocky-release-4:9090/graph?g0.expr=100+%2A+%281+-+avg+by%28instance%29+%28irate%28node_cpu_seconds_total%7Bjob%3D%22node%22%2Cmode%3D%22idle%22%7D%5B5m%5D%29%29%29+%3E+20&g0.tab=1",
"labels": {
"alertname": "AvgCPUTimeOnIdleMode",
"instance": "135.248.18.109:9100",
"severity": "warning"
},
"receivers": [
"vitrage"
],
"startsAt": "2018-12-26T15:22:07.589431274Z",
"status": "firing"
}
A full description of Prometheus alert structure can be found in prometheus_alert_description
Alerts based on libvirt metrics
All libvirt alerts have the following labels:
instance
: holds the hostname/ip of where the exporter is runningdomain
: libvirt name that the exporter is scraping
Alert based on libvirt CPU metrics
Prometheus alert example
{
"annotations": {
"description": "Test alert to test libvirt exporter.\n",
"title": "High cpu usage on vm"
},
"endsAt": "2018-12-30T09:44:05.91446215Z",
"generatorURL": "http://devstack-rocky-release-4:9090/graph?g0.expr=rate%28libvirt_domain_info_cpu_time_seconds_total%5B1m%5D%29+%2A+10000+%3E+13&g0.tab=1",
"labels": {
"alertname": "HighCpuOnVmAlert",
"domain": "instance-00000004",
"instance": "135.248.18.109:9177",
"job": "libvirt",
"severity": "critical"
},
"receivers": [
"vitrage"
],
"startsAt": "2018-12-26T15:23:05.91446215Z",
"status": "firing"
}
Vitrage resource
Vitrage resource can be uniquely identified by the
instance
and domain
labels.
Alert based on libvirt network metrics
Prometheus alert example
{
"annotations": {
"description": "Another test alert to test libvirt exporter.\n",
"title": "High traffic on bridge"
},
"endsAt": "2018-12-30T09:43:50.91446215Z",
"generatorURL": "http://devstack-rocky-release-4:9090/graph?g0.expr=rate%28libvirt_domain_interface_stats_receive_bytes_total%5B5m%5D%29+%3E+0&g0.tab=1",
"labels": {
"alertname": "HighTrafficOnBridge",
"domain": "instance-00000004",
"instance": "135.248.18.109:9177",
"job": "libvirt",
"severity": "critical",
"source_bridge": "br-int",
"target_device": "tap456ab233-f4"
},
"receivers": [
"vitrage"
],
"startsAt": "2018-12-26T15:22:05.91446215Z",
"status": "firing"
}
Vitrage resource
- Short term: raise the alarm on the node or instance. Vitrage
resource can be uniquely identified by the
instance
anddomain
labels. - Long term: Vitrage should hold a resource for br-int and the alarm
should be connected to that resource. Vitrage resource can be uniquely
identified by the
instance
,domain
,source_bridge
andtarget_device
labels.
Node metrics
Prometheus alert
All Node metrics have a instance
that holds the address
of exporter. The exporter can scrape metrics from the instance it is
running on. In this case
''instancelabel represents the resource address. Also, It can scrape different metrics not from the instance (e.g. network metrics). In this case
instanceis just an address of the exporter and other labels indicates to the resource. Alert based on node CPU metric ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ **Prometheus alert** CPU metrics are scraped from the instance so
instancelabel represents the resource address. **Prometheus alert example** .. code-block:: json { "annotations": { "description": "The average amount of CPU time spent in idle mode, per second, over the last minute (in seconds)", "title": "High average CPU time on idle mode" }, "endsAt": "2018-12-30T09:43:52.589431274Z", "generatorURL": "http://devstack-rocky-release-4:9090/graph?g0.expr=100+%2A+%281+-+avg+by%28instance%29+%28irate%28node_cpu_seconds_total%7Bjob%3D%22node%22%2Cmode%3D%22idle%22%7D%5B5m%5D%29%29%29+%3E+20&g0.tab=1", "labels": { "alertname": "AvgCPUTimeOnIdleMode", "instance": "135.248.18.109:9100", "severity": "warning" }, "receivers": [ "vitrage" ], "startsAt": "2018-12-26T15:22:07.589431274Z", "status": "firing" } **Vitrage resource** Vitrage resource can be uniquely identified by the
instancelabel. Alert based on node filesystem metric ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ **Prometheus alert example** .. code-block:: json { "annotations": { "description": "\"Consider ssh'ing into the instance and removing files or clean\ntemp files\"\n", "device": "/dev/vda1", "mount_point": "/", "runbook": "troubleshooting/filesystem_alerts_inodes.md", "title": "High number of inode usage", "value": "92.42%" }, "endsAt": "2018-12-30T09:43:52.589431274Z", "generatorURL": "http://devstack-rocky-release-4:9090/graph?g0.expr=node_filesystem_files_free%7Bfstype%3D~%22%28ext.%7Cxfs%29%22%2Cjob%3D%22node%22%7D+%2F+node_filesystem_files%7Bfstype%3D~%22%28ext.%7Cxfs%29%22%2Cjob%3D%22node%22%7D+%2A+100+%3C%3D+100&g0.tab=1", "labels": { "alertname": "HighInodeUsage", "device": "/dev/vda1", "fstype": "ext4", "instance": "135.248.18.109:9100", "job": "node", "mountpoint": "/", "severity": "critical" }, "receivers": [ "vitrage" ], "startsAt": "2018-12-26T15:22:07.589431274Z", "status": "firing" } **Vitrage resource** * Short term: raise the alarm on the node or instance. Vitrage resource can be uniquely identified by the
instancelabel. * Long term: Vitrage should hold a resource for ext4 and the alarm should be connected to that resource. Vitrage resource can be uniquely identified by the
instance,
deviceand
fstypelabels. Proposed change =============== A configuration file that maps the Prometheus labels to a corresponding Vitrage resource with specific properties (id or other unique properties). The mapping will most likely be defined by the alert name and other fields. Prometheus configuration file structure --------------------------------------- The configuration file contains a list of
alerts. Each alert contains
keyand
resource. The
keycontains labels which uniquely identify each alert. The
resource``
specifies how to identify in Vitrage the resource that the alert is on.
It contains one or more Vitrage property names and corresponding
Prometheus alert labels.
Configuration file example
alerts:
- key:
alertname: HighCpuOnVmAlert
job: libvirt
resource:
instance_name: domain
host_id: instance
- key:
alertname: HighTrafficOnBridge
job: libvirt
resource:
instance_name: domain
host_id: instance
- key:
alertname: AvgCPUTimeOnIdleMode
resource:
id: instance
- key:
alertname: HighInodeUsage
job: node
resource:
id: instance
Alternatives
None
Data model impact
None
REST API impact
None
Versioning impact
None
Other end user impact
None
Deployer impact
TBD
Developer impact
None
Horizon impact
None
Implementation
Assignee(s)
- Primary assignee:
-
7mode3294 (Muhamad Najjar)
Work Items
- Load configuration file and use it in the Prometheus transformer.
- Documentations and tests.
Dependencies
None
Testing
Unit tests, functional tests and tempest tests
Documentation Impact
The new configuration will be documented
References
- Prometheus datasource: https://github.com/openstack/vitrage/tree/master/vitrage/datasources/prometheus
- Prometheus alerting rules: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
- Prometheus libvirt exporter: https://github.com/CanonicalLtd/prometheus-openstack-exporter
- Prometheus node exporter: https://github.com/prometheus/node_exporter