Foundation for LMA docs
This begins building documentation for the LMA services included in openstack-helm-infra. This includes documentation for: kibana, elasticsearch, fluent-logging, grafana, prometheus, and nagios Change-Id: Iaa24be04748e76fabca998972398802e7e921ef1 Signed-off-by: Steve Wilkerson <wilkers.steve@gmail.com>
This commit is contained in:
parent
1c87af7856
commit
eab9ca05a6
@ -8,7 +8,9 @@ Contents:
|
||||
|
||||
install/index
|
||||
testing/index
|
||||
|
||||
monitoring/index
|
||||
logging/index
|
||||
readme
|
||||
|
||||
Indices and Tables
|
||||
==================
|
||||
|
196
doc/source/logging/elasticsearch.rst
Normal file
196
doc/source/logging/elasticsearch.rst
Normal file
@ -0,0 +1,196 @@
|
||||
Elasticsearch
|
||||
=============
|
||||
|
||||
The Elasticsearch chart in openstack-helm-infra provides a distributed data
|
||||
store to index and analyze logs generated from the OpenStack-Helm services.
|
||||
The chart contains templates for:
|
||||
|
||||
- Elasticsearch client nodes
|
||||
- Elasticsearch data nodes
|
||||
- Elasticsearch master nodes
|
||||
- An Elasticsearch exporter for providing cluster metrics to Prometheus
|
||||
- A cronjob for Elastic Curator to manage data indices
|
||||
|
||||
Authentication
|
||||
--------------
|
||||
|
||||
The Elasticsearch deployment includes a sidecar container that runs an Apache
|
||||
reverse proxy to add authentication capabilities for Elasticsearch. The
|
||||
username and password are configured under the Elasticsearch entry in the
|
||||
endpoints section of the chart's values.yaml.
|
||||
|
||||
The configuration for Apache can be found under the conf.httpd key, and uses a
|
||||
helm-toolkit function that allows for including gotpl entries in the template
|
||||
directly. This allows the use of other templates, like the endpoint lookup
|
||||
function templates, directly in the configuration for Apache.
|
||||
|
||||
Elasticsearch Service Configuration
|
||||
-----------------------------------
|
||||
|
||||
The Elasticsearch service configuration file can be modified with a combination
|
||||
of pod environment variables and entries in the values.yaml file. Elasticsearch
|
||||
does not require much configuration out of the box, and the default values for
|
||||
these configuration settings are meant to provide a highly available cluster by
|
||||
default.
|
||||
|
||||
The vital entries in this configuration file are:
|
||||
|
||||
- path.data: The path at which to store the indexed data
|
||||
- path.repo: The location of any snapshot repositories to backup indexes
|
||||
- bootstrap.memory_lock: Ensures none of the JVM is swapped to disk
|
||||
- discovery.zen.minimum_master_nodes: Minimum required masters for the cluster
|
||||
|
||||
The bootstrap.memory_lock entry ensures none of the JVM will be swapped to disk
|
||||
during execution, and setting this value to false will negatively affect the
|
||||
health of your Elasticsearch nodes. The discovery.zen.minimum_master_nodes flag
|
||||
registers the minimum number of masters required for your Elasticsearch cluster
|
||||
to register as healthy and functional.
|
||||
|
||||
To read more about Elasticsearch's configuration file, please see the official
|
||||
documentation_.
|
||||
|
||||
.. _documentation: https://www.elastic.co/guide/en/elasticsearch/reference/current/important-settings.html
|
||||
|
||||
Elastic Curator
|
||||
---------------
|
||||
|
||||
The Elasticsearch chart contains a cronjob to run Elastic Curator at specified
|
||||
intervals to manage the lifecycle of your indices. Curator can perform:
|
||||
|
||||
- Take and send a snapshot of your indexes to a specified snapshot repository
|
||||
- Delete indexes older than a specified length of time
|
||||
- Restore indexes with previous index snapshots
|
||||
- Reindex an index into a new or preexisting index
|
||||
|
||||
The full list of supported Curator actions can be found in the actions_ section of
|
||||
the official Curator documentation. The list of options available for those
|
||||
actions can be found in the options_ section of the Curator documentation.
|
||||
|
||||
.. _actions: https://www.elastic.co/guide/en/elasticsearch/client/curator/current/actions.html
|
||||
.. _options: https://www.elastic.co/guide/en/elasticsearch/client/curator/current/options.html
|
||||
|
||||
Curator's configuration is handled via entries in Elasticsearch's values.yaml
|
||||
file and must be overridden to achieve your index lifecycle management
|
||||
needs. Please note that any unused field should be left blank, as an entry of
|
||||
"None" will result in an exception, as Curator will read it as a Python NoneType
|
||||
insead of a value of None.
|
||||
|
||||
The section for Curator's service configuration can be found at:
|
||||
|
||||
::
|
||||
|
||||
conf:
|
||||
curator:
|
||||
config:
|
||||
client:
|
||||
hosts:
|
||||
- elasticsearch-logging
|
||||
port: 9200
|
||||
url_prefix:
|
||||
use_ssl: False
|
||||
certificate:
|
||||
client_cert:
|
||||
client_key:
|
||||
ssl_no_validate: False
|
||||
http_auth:
|
||||
timeout: 30
|
||||
master_only: False
|
||||
logging:
|
||||
loglevel: INFO
|
||||
logfile:
|
||||
logformat: default
|
||||
blacklist: ['elasticsearch', 'urllib3']
|
||||
|
||||
Curator's actions are configured in the following section:
|
||||
|
||||
::
|
||||
|
||||
conf:
|
||||
curator:
|
||||
action_file:
|
||||
actions:
|
||||
1:
|
||||
action: delete_indices
|
||||
description: "Clean up ES by deleting old indices"
|
||||
options:
|
||||
timeout_override:
|
||||
continue_if_exception: False
|
||||
ignore_empty_list: True
|
||||
disable_action: True
|
||||
filters:
|
||||
- filtertype: age
|
||||
source: name
|
||||
direction: older
|
||||
timestring: '%Y.%m.%d'
|
||||
unit: days
|
||||
unit_count: 30
|
||||
field:
|
||||
stats_result:
|
||||
epoch:
|
||||
exclude: False
|
||||
|
||||
The Elasticsearch chart contains commented example actions for deleting and
|
||||
snapshotting indexes older 30 days. Please note these actions are provided as a
|
||||
reference and are disabled by default to avoid any unexpected behavior against
|
||||
your indexes.
|
||||
|
||||
Elasticsearch Exporter
|
||||
----------------------
|
||||
|
||||
The Elasticsearch chart contains templates for an exporter to provide metrics
|
||||
for Prometheus. These metrics provide insight into the performance and overall
|
||||
health of your Elasticsearch cluster. Please note monitoring for Elasticsearch
|
||||
is disabled by default, and must be enabled with the following override:
|
||||
|
||||
|
||||
::
|
||||
|
||||
monitoring:
|
||||
prometheus:
|
||||
enabled: true
|
||||
|
||||
|
||||
The Elasticsearch exporter uses the same service annotations as the other
|
||||
exporters, and no additional configuration is required for Prometheus to target
|
||||
the Elasticsearch exporter for scraping. The Elasticsearch exporter is
|
||||
configured with command line flags, and the flags' default values can be found
|
||||
under the following key in the values.yaml file:
|
||||
|
||||
::
|
||||
|
||||
conf:
|
||||
prometheus_elasticsearch_exporter:
|
||||
es:
|
||||
all: true
|
||||
timeout: 20s
|
||||
|
||||
The configuration keys configure the following behaviors:
|
||||
|
||||
- es.all: Gather information from all nodes, not just the connecting node
|
||||
- es.timeout: Timeout for metrics queries
|
||||
|
||||
More information about the Elasticsearch exporter can be found on the exporter's
|
||||
GitHub_ page.
|
||||
|
||||
.. _GitHub: https://github.com/justwatchcom/elasticsearch_exporter
|
||||
|
||||
|
||||
Snapshot Repositories
|
||||
---------------------
|
||||
|
||||
Before Curator can store snapshots in a specified repository, Elasticsearch must
|
||||
register the configured repository. To achieve this, the Elasticsearch chart
|
||||
contains a job for registering an s3 snapshot repository backed by radosgateway.
|
||||
This job is disabled by default as the curator actions for snapshots are
|
||||
disabled by default. To enable the snapshot job, the
|
||||
conf.elasticsearch.snapshots.enabled flag must be set to true. The following
|
||||
configuration keys are relevant:
|
||||
|
||||
- conf.elasticsearch.snapshots.enabled: Enable snapshot repositories
|
||||
- conf.elasticsearch.snapshots.bucket: Name of the RGW s3 bucket to use
|
||||
- conf.elasticsearch.snapshots.repositories: Name of repositories to create
|
||||
|
||||
More information about Elasticsearch repositories can be found in the official
|
||||
Elasticsearch snapshot_ documentation:
|
||||
|
||||
.. _snapshot: https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-snapshots.html#_repositories
|
279
doc/source/logging/fluent-logging.rst
Normal file
279
doc/source/logging/fluent-logging.rst
Normal file
@ -0,0 +1,279 @@
|
||||
Fluent-logging
|
||||
===============
|
||||
|
||||
The fluent-logging chart in openstack-helm-infra provides the base for a
|
||||
centralized logging platform for OpenStack-Helm. The chart combines two
|
||||
services, Fluentbit and Fluentd, to gather logs generated by the services,
|
||||
filter on or add metadata to logged events, then forward them to Elasticsearch
|
||||
for indexing.
|
||||
|
||||
Fluentbit
|
||||
---------
|
||||
|
||||
Fluentbit runs as a log-collecting component on each host in the cluster, and
|
||||
can be configured to target specific log locations on the host. The Fluentbit_
|
||||
configuration schema can be found on the official Fluentbit website.
|
||||
|
||||
.. _Fluentbit: http://fluentbit.io/documentation/0.12/configuration/schema.html
|
||||
|
||||
Fluentbit provides a set of plug-ins for ingesting and filtering various log
|
||||
types. These plug-ins include:
|
||||
|
||||
- Tail: Tails a defined file for logged events
|
||||
- Kube: Adds Kubernetes metadata to a logged event
|
||||
- Systemd: Provides ability to collect logs from the journald daemon
|
||||
- Syslog: Provides the ability to collect logs from a Unix socket (TCP or UDP)
|
||||
|
||||
The complete list of plugins can be found in the configuration_ section of the
|
||||
Fluentbit documentation.
|
||||
|
||||
.. _configuration: http://fluentbit.io/documentation/current/configuration/
|
||||
|
||||
Fluentbit uses parsers to turn unstructured log entries into structured entries
|
||||
to make processing and filtering events easier. The two formats supported are
|
||||
JSON maps and regular expressions. More information about Fluentbit's parsing
|
||||
abilities can be found in the parsers_ section of Fluentbit's documentation.
|
||||
|
||||
.. _parsers: http://fluentbit.io/documentation/current/parser/
|
||||
|
||||
Fluentbit's service and parser configurations are defined via the values.yaml
|
||||
file, which allows for custom definitions of inputs, filters and outputs for
|
||||
your logging needs.
|
||||
Fluentbit's configuration can be found under the following key:
|
||||
|
||||
::
|
||||
|
||||
conf:
|
||||
fluentbit:
|
||||
- service:
|
||||
header: service
|
||||
Flush: 1
|
||||
Daemon: Off
|
||||
Log_Level: info
|
||||
Parsers_File: parsers.conf
|
||||
- containers_tail:
|
||||
header: input
|
||||
Name: tail
|
||||
Tag: kube.*
|
||||
Path: /var/log/containers/*.log
|
||||
Parser: docker
|
||||
DB: /var/log/flb_kube.db
|
||||
Mem_Buf_Limit: 5MB
|
||||
- kube_filter:
|
||||
header: filter
|
||||
Name: kubernetes
|
||||
Match: kube.*
|
||||
Merge_JSON_Log: On
|
||||
- fluentd_output:
|
||||
header: output
|
||||
Name: forward
|
||||
Match: "*"
|
||||
Host: ${FLUENTD_HOST}
|
||||
Port: ${FLUENTD_PORT}
|
||||
|
||||
Fluentbit is configured by default to capture logs at the info log level. To
|
||||
change this, override the Log_Level key with the appropriate levels, which are
|
||||
documented in Fluentbit's configuration_.
|
||||
|
||||
Fluentbit's parser configuration can be found under the following key:
|
||||
|
||||
::
|
||||
|
||||
conf:
|
||||
parsers:
|
||||
- docker:
|
||||
header: parser
|
||||
Name: docker
|
||||
Format: json
|
||||
Time_Key: time
|
||||
Time_Format: "%Y-%m-%dT%H:%M:%S.%L"
|
||||
Time_Keep: On
|
||||
|
||||
The values for the fluentbit and parsers keys are consumed by a fluent-logging
|
||||
helper template that produces the appropriate configurations for the relevant
|
||||
sections. Each list item (keys prefixed with a '-') represents a section in the
|
||||
configuration files, and the arbitrary name of the list item should represent a
|
||||
logical description of the section defined. The header key represents the type
|
||||
of definition (filter, input, output, service or parser), and the remaining
|
||||
entries will be rendered as space delimited configuration keys and values. For
|
||||
example, the definitions above would result in the following:
|
||||
|
||||
::
|
||||
|
||||
[SERVICE]
|
||||
Daemon false
|
||||
Flush 1
|
||||
Log_Level info
|
||||
Parsers_File parsers.conf
|
||||
[INPUT]
|
||||
DB /var/log/flb_kube.db
|
||||
Mem_Buf_Limit 5MB
|
||||
Name tail
|
||||
Parser docker
|
||||
Path /var/log/containers/*.log
|
||||
Tag kube.*
|
||||
[FILTER]
|
||||
Match kube.*
|
||||
Merge_JSON_Log true
|
||||
Name kubernetes
|
||||
[OUTPUT]
|
||||
Host ${FLUENTD_HOST}
|
||||
Match *
|
||||
Name forward
|
||||
Port ${FLUENTD_PORT}
|
||||
[PARSER]
|
||||
Format json
|
||||
Name docker
|
||||
Time_Format %Y-%m-%dT%H:%M:%S.%L
|
||||
Time_Keep true
|
||||
Time_Key time
|
||||
|
||||
Fluentd
|
||||
-------
|
||||
|
||||
Fluentd runs as a forwarding service that receives event entries from Fluentbit
|
||||
and routes them to the appropriate destination. By default, Fluentd will route
|
||||
all entries received from Fluentbit to Elasticsearch for indexing. The
|
||||
Fluentd_ configuration schema can be found at the official Fluentd website.
|
||||
|
||||
.. _Fluentd: https://docs.fluentd.org/v0.12/articles/config-file
|
||||
|
||||
Fluentd's configuration is handled in the values.yaml file in fluent-logging.
|
||||
Similar to Fluentbit, configuration overrides provide flexibility in defining
|
||||
custom routes for tagged log events. The configuration can be found under the
|
||||
following key:
|
||||
|
||||
::
|
||||
|
||||
conf:
|
||||
fluentd:
|
||||
- fluentbit_forward:
|
||||
header: source
|
||||
type: forward
|
||||
port: "#{ENV['FLUENTD_PORT']}"
|
||||
bind: 0.0.0.0
|
||||
- elasticsearch:
|
||||
header: match
|
||||
type: elasticsearch
|
||||
expression: "**"
|
||||
include_tag_key: true
|
||||
host: "#{ENV['ELASTICSEARCH_HOST']}"
|
||||
port: "#{ENV['ELASTICSEARCH_PORT']}"
|
||||
logstash_format: true
|
||||
buffer_chunk_limit: 10M
|
||||
buffer_queue_limit: 32
|
||||
flush_interval: "20"
|
||||
max_retry_wait: 300
|
||||
disable_retry_limit: ""
|
||||
|
||||
The values for the fluentd keys are consumed by a fluent-logging helper template
|
||||
that produces appropriate configurations for each directive desired. The list
|
||||
items (keys prefixed with a '-') represent sections in the configuration file,
|
||||
and the name of each list item should represent a logical description of the
|
||||
section defined. The header key represents the type of definition (name of the
|
||||
fluentd plug-in used), and the expression key is used when the plug-in requires
|
||||
a pattern to match against (example: matches on certain input patterns). The
|
||||
remaining entries will be rendered as space delimited configuration keys and
|
||||
values. For example, the definition above would result in the following:
|
||||
|
||||
::
|
||||
|
||||
<source>
|
||||
bind 0.0.0.0
|
||||
port "#{ENV['FLUENTD_PORT']}"
|
||||
@type forward
|
||||
</source>
|
||||
<match **>
|
||||
buffer_chunk_limit 10M
|
||||
buffer_queue_limit 32
|
||||
disable_retry_limit
|
||||
flush_interval 20s
|
||||
host "#{ENV['ELASTICSEARCH_HOST']}"
|
||||
include_tag_key true
|
||||
logstash_format true
|
||||
max_retry_wait 300
|
||||
port "#{ENV['ELASTICSEARCH_PORT']}"
|
||||
@type elasticsearch
|
||||
</match>
|
||||
|
||||
Some fluentd plug-ins require nested definitions. The fluentd helper template
|
||||
can handle these definitions with the following structure:
|
||||
|
||||
::
|
||||
|
||||
conf:
|
||||
td_agent:
|
||||
- fluentbit_forward:
|
||||
header: source
|
||||
type: forward
|
||||
port: "#{ENV['FLUENTD_PORT']}"
|
||||
bind: 0.0.0.0
|
||||
- log_transformer:
|
||||
header: filter
|
||||
type: record_transformer
|
||||
expression: "foo.bar"
|
||||
inner_def:
|
||||
- record_transformer:
|
||||
header: record
|
||||
hostname: my_host
|
||||
tag: my_tag
|
||||
|
||||
In this example, the my_transformer list will generate a nested configuration
|
||||
entry in the log_transformer section. The nested definitions are handled by
|
||||
supplying a list as the value for an arbitrary key, and the list value will
|
||||
indicate the entry should be handled as a nested definition. The helper
|
||||
template will render the above example key/value pairs as the following:
|
||||
|
||||
::
|
||||
|
||||
<source>
|
||||
bind 0.0.0.0
|
||||
port "#{ENV['FLUENTD_PORT']}"
|
||||
@type forward
|
||||
</source>
|
||||
<filter foo.bar>
|
||||
<record>
|
||||
hostname my_host
|
||||
tag my_tag
|
||||
</record>
|
||||
@type record_transformer
|
||||
</filter>
|
||||
|
||||
Fluentd Exporter
|
||||
----------------------
|
||||
|
||||
The fluent-logging chart contains templates for an exporter to provide metrics
|
||||
for Fluentd. These metrics provide insight into Fluentd's performance. Please
|
||||
note monitoring for Fluentd is disabled by default, and must be enabled with the
|
||||
following override:
|
||||
|
||||
::
|
||||
|
||||
monitoring:
|
||||
prometheus:
|
||||
enabled: true
|
||||
|
||||
|
||||
The Fluentd exporter uses the same service annotations as the other exporters,
|
||||
and no additional configuration is required for Prometheus to target the
|
||||
Fluentd exporter for scraping. The Fluentd exporter is configured with command
|
||||
line flags, and the flags' default values can be found under the following key
|
||||
in the values.yaml file:
|
||||
|
||||
::
|
||||
|
||||
conf:
|
||||
fluentd_exporter:
|
||||
log:
|
||||
format: "logger:stdout?json=true"
|
||||
level: "info"
|
||||
|
||||
The configuration keys configure the following behaviors:
|
||||
|
||||
- log.format: Define the logger used and format of the output
|
||||
- log.level: Log level for the exporter to use
|
||||
|
||||
More information about the Fluentd exporter can be found on the exporter's
|
||||
GitHub_ page.
|
||||
|
||||
.. _GitHub: https://github.com/V3ckt0r/fluentd_exporter
|
11
doc/source/logging/index.rst
Normal file
11
doc/source/logging/index.rst
Normal file
@ -0,0 +1,11 @@
|
||||
OpenStack-Helm Logging
|
||||
======================
|
||||
|
||||
Contents:
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
elasticsearch
|
||||
fluent-logging
|
||||
kibana
|
76
doc/source/logging/kibana.rst
Normal file
76
doc/source/logging/kibana.rst
Normal file
@ -0,0 +1,76 @@
|
||||
Kibana
|
||||
======
|
||||
|
||||
The Kibana chart in OpenStack-Helm Infra provides visualization for logs indexed
|
||||
into Elasticsearch. These visualizations provide the means to view logs captured
|
||||
from services deployed in cluster and targeted for collection by Fluentbit.
|
||||
|
||||
Authentication
|
||||
--------------
|
||||
|
||||
The Kibana deployment includes a sidecar container that runs an Apache reverse
|
||||
proxy to add authentication capabilities for Kibana. The username and password
|
||||
are configured under the Kibana entry in the endpoints section of the chart's
|
||||
values.yaml.
|
||||
|
||||
The configuration for Apache can be found under the conf.httpd key, and uses a
|
||||
helm-toolkit function that allows for including gotpl entries in the template
|
||||
directly. This allows the use of other templates, like the endpoint lookup
|
||||
function templates, directly in the configuration for Apache.
|
||||
|
||||
Configuration
|
||||
-------------
|
||||
|
||||
Kibana's configuration is driven by the chart's values.yaml file. The configuration
|
||||
options are found under the following keys:
|
||||
|
||||
::
|
||||
|
||||
conf:
|
||||
elasticsearch:
|
||||
pingTimeout: 1500
|
||||
preserveHost: true
|
||||
requestTimeout: 30000
|
||||
shardTimeout: 0
|
||||
startupTimeout: 5000
|
||||
il8n:
|
||||
defaultLocale: en
|
||||
kibana:
|
||||
defaultAppId: discover
|
||||
index: .kibana
|
||||
logging:
|
||||
quiet: false
|
||||
silent: false
|
||||
verbose: false
|
||||
ops:
|
||||
interval: 5000
|
||||
server:
|
||||
host: localhost
|
||||
maxPayloadBytes: 1048576
|
||||
port: 5601
|
||||
ssl:
|
||||
enabled: false
|
||||
|
||||
The case of the sub-keys is important as these values are injected into
|
||||
Kibana's configuration configmap with the toYaml function. More information on
|
||||
the configuration options and available settings can be found in the official
|
||||
Kibana documentation_.
|
||||
|
||||
.. _documentation: https://www.elastic.co/guide/en/kibana/current/settings.html
|
||||
|
||||
Installation
|
||||
------------
|
||||
|
||||
.. code_block: bash
|
||||
|
||||
helm install --namespace=<namespace> local/kibana --name=kibana
|
||||
|
||||
Setting Time Field
|
||||
------------------
|
||||
|
||||
For Kibana to successfully read the logs from Elasticsearch's indexes, the time
|
||||
field will need to be manually set after Kibana has successfully deployed. Upon
|
||||
visiting the Kibana dashboard for the first time, a prompt will appear to choose the
|
||||
time field with a drop down menu. The default time field for Elasticsearch indexes
|
||||
is '@timestamp'. Once this field is selected, the default view for querying log entries
|
||||
can be found by selecting the "Discover"
|
89
doc/source/monitoring/grafana.rst
Normal file
89
doc/source/monitoring/grafana.rst
Normal file
@ -0,0 +1,89 @@
|
||||
Grafana
|
||||
=======
|
||||
|
||||
The Grafana chart in OpenStack-Helm Infra provides default dashboards for the
|
||||
metrics gathered with Prometheus. The default dashboards include visualizations
|
||||
for metrics on: Ceph, Kubernetes, nodes, containers, MySQL, RabbitMQ, and
|
||||
OpenStack.
|
||||
|
||||
Configuration
|
||||
-------------
|
||||
|
||||
Grafana
|
||||
~~~~~~~
|
||||
|
||||
Grafana's configuration is driven with the chart's values.YAML file, and the
|
||||
relevant configuration entries are under the following key:
|
||||
|
||||
::
|
||||
|
||||
conf:
|
||||
grafana:
|
||||
paths:
|
||||
server:
|
||||
database:
|
||||
session:
|
||||
security:
|
||||
users:
|
||||
log:
|
||||
log.console:
|
||||
dashboards.json:
|
||||
grafana_net:
|
||||
|
||||
These keys correspond to sections in the grafana.ini configuration file, and the
|
||||
to_ini helm-toolkit function will render these values into the appropriate
|
||||
format in grafana.ini. The list of options for these keys can be found in the
|
||||
official Grafana configuration_ documentation.
|
||||
|
||||
.. _configuration: http://docs.grafana.org/installation/configuration/
|
||||
|
||||
Prometheus Data Source
|
||||
~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Grafana requires configured data sources for gathering metrics for display in
|
||||
its dashboards. The configuration options for datasources are found under the
|
||||
following key in Grafana's values.YAML file:
|
||||
|
||||
::
|
||||
|
||||
conf:
|
||||
provisioning:
|
||||
datasources;
|
||||
monitoring:
|
||||
name: prometheus
|
||||
type: prometheus
|
||||
access: proxy
|
||||
orgId: 1
|
||||
editable: true
|
||||
basicAuth: true
|
||||
|
||||
The Grafana chart will use the keys under each entry beneath
|
||||
.conf.provisioning.datasources as inputs to a helper template that will render
|
||||
the appropriate configuration for the data source. The key for each data source
|
||||
(monitoring in the above example) should map to an entry in the endpoints
|
||||
section in the chart's values.yaml, as the data source's URL and authentication
|
||||
credentials will be populated by the values defined in the defined endpoint.
|
||||
|
||||
.. _sources: http://docs.grafana.org/features/datasources/
|
||||
|
||||
Dashboards
|
||||
~~~~~~~~~~
|
||||
|
||||
Grafana adds dashboards during installation with dashboards defined in YAML under
|
||||
the following key:
|
||||
|
||||
::
|
||||
|
||||
conf:
|
||||
dashboards:
|
||||
|
||||
|
||||
These YAML definitiions are transformed to JSON are added to Grafana's
|
||||
configuration configmap and mounted to the Grafana pods dynamically, allowing for
|
||||
flexibility in defining and adding custom dashboards to Grafana. Dashboards can
|
||||
be added by inserting a new key along with a YAML dashboard definition as the
|
||||
value. Additional dashboards can be found by searching on Grafana's dashboards_
|
||||
page or you can define your own. A json-to-YAML tool, such as json2yaml_ , will
|
||||
help transform any custom or new dashboards from JSON to YAML.
|
||||
|
||||
.. _json2yaml: https://www.json2yaml.com/
|
11
doc/source/monitoring/index.rst
Normal file
11
doc/source/monitoring/index.rst
Normal file
@ -0,0 +1,11 @@
|
||||
OpenStack-Helm Monitoring
|
||||
=========================
|
||||
|
||||
Contents:
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
grafana
|
||||
prometheus
|
||||
nagios
|
365
doc/source/monitoring/nagios.rst
Normal file
365
doc/source/monitoring/nagios.rst
Normal file
@ -0,0 +1,365 @@
|
||||
Nagios
|
||||
======
|
||||
|
||||
The Nagios chart in openstack-helm-infra can be used to provide an alarming
|
||||
service that's tightly coupled to an OpenStack-Helm deployment. The Nagios
|
||||
chart uses a custom Nagios core image that includes plugins developed to query
|
||||
Prometheus directly for scraped metrics and triggered alarms, query the Ceph
|
||||
manager endpoints directly to determine the health of a Ceph cluster, and to
|
||||
query Elasticsearch for logged events that meet certain criteria (experimental).
|
||||
|
||||
Authentication
|
||||
--------------
|
||||
|
||||
The Nagios deployment includes a sidecar container that runs an Apache reverse
|
||||
proxy to add authentication capabilities for Nagios. The username and password
|
||||
are configured under the nagios entry in the endpoints section of the chart's
|
||||
values.yaml.
|
||||
|
||||
The configuration for Apache can be found under the conf.httpd key, and uses a
|
||||
helm-toolkit function that allows for including gotpl entries in the template
|
||||
directly. This allows the use of other templates, like the endpoint lookup
|
||||
function templates, directly in the configuration for Apache.
|
||||
|
||||
Image Plugins
|
||||
-------------
|
||||
|
||||
The Nagios image used contains custom plugins that can be used for the defined
|
||||
service check commands. These plugins include:
|
||||
|
||||
- check_prometheus_metric.py: Query Prometheus for a specific metric and value
|
||||
- check_exporter_health_metric.sh: Nagios plugin to query prometheus exporter
|
||||
- check_rest_get_api.py: Check REST API status
|
||||
- check_update_prometheus_hosts.py: Queries Prometheus, updates Nagios config
|
||||
- query_prometheus_alerts.py: Nagios plugin to query prometheus ALERTS metric
|
||||
|
||||
More information about the Nagios image and plugins can be found here_.
|
||||
|
||||
.. _here: https://github.com/att-comdev/nagios
|
||||
|
||||
|
||||
Nagios Service Configuration
|
||||
----------------------------
|
||||
|
||||
The Nagios service is configured via the following section in the chart's
|
||||
values file:
|
||||
|
||||
::
|
||||
|
||||
conf:
|
||||
nagios:
|
||||
nagios:
|
||||
log_file: /opt/nagios/var/log/nagios.log
|
||||
cfg_file:
|
||||
- /opt/nagios/etc/nagios_objects.cfg
|
||||
- /opt/nagios/etc/objects/commands.cfg
|
||||
- /opt/nagios/etc/objects/contacts.cfg
|
||||
- /opt/nagios/etc/objects/timeperiods.cfg
|
||||
- /opt/nagios/etc/objects/templates.cfg
|
||||
- /opt/nagios/etc/objects/prometheus_discovery_objects.cfg
|
||||
object_cache_file: /opt/nagios/var/objects.cache
|
||||
precached_object_file: /opt/nagios/var/objects.precache
|
||||
resource_file: /opt/nagios/etc/resource.cfg
|
||||
status_file: /opt/nagios/var/status.dat
|
||||
status_update_interval: 10
|
||||
nagios_user: nagios
|
||||
nagios_group: nagios
|
||||
check_external_commands: 1
|
||||
command_file: /opt/nagios/var/rw/nagios.cmd
|
||||
lock_file: /var/run/nagios.lock
|
||||
temp_file: /opt/nagios/var/nagios.tmp
|
||||
temp_path: /tmp
|
||||
event_broker_options: -1
|
||||
log_rotation_method: d
|
||||
log_archive_path: /opt/nagios/var/log/archives
|
||||
use_syslog: 1
|
||||
log_service_retries: 1
|
||||
log_host_retries: 1
|
||||
log_event_handlers: 1
|
||||
log_initial_states: 0
|
||||
log_current_states: 1
|
||||
log_external_commands: 1
|
||||
log_passive_checks: 1
|
||||
service_inter_check_delay_method: s
|
||||
max_service_check_spread: 30
|
||||
service_interleave_factor: s
|
||||
host_inter_check_delay_method: s
|
||||
max_host_check_spread: 30
|
||||
max_concurrent_checks: 60
|
||||
check_result_reaper_frequency: 10
|
||||
max_check_result_reaper_time: 30
|
||||
check_result_path: /opt/nagios/var/spool/checkresults
|
||||
max_check_result_file_age: 3600
|
||||
cached_host_check_horizon: 15
|
||||
cached_service_check_horizon: 15
|
||||
enable_predictive_host_dependency_checks: 1
|
||||
enable_predictive_service_dependency_checks: 1
|
||||
soft_state_dependencies: 0
|
||||
auto_reschedule_checks: 0
|
||||
auto_rescheduling_interval: 30
|
||||
auto_rescheduling_window: 180
|
||||
service_check_timeout: 60
|
||||
host_check_timeout: 60
|
||||
event_handler_timeout: 60
|
||||
notification_timeout: 60
|
||||
ocsp_timeout: 5
|
||||
perfdata_timeout: 5
|
||||
retain_state_information: 1
|
||||
state_retention_file: /opt/nagios/var/retention.dat
|
||||
retention_update_interval: 60
|
||||
use_retained_program_state: 1
|
||||
use_retained_scheduling_info: 1
|
||||
retained_host_attribute_mask: 0
|
||||
retained_service_attribute_mask: 0
|
||||
retained_process_host_attribute_mask: 0
|
||||
retained_process_service_attribute_mask: 0
|
||||
retained_contact_host_attribute_mask: 0
|
||||
retained_contact_service_attribute_mask: 0
|
||||
interval_length: 1
|
||||
check_workers: 4
|
||||
check_for_updates: 1
|
||||
bare_update_check: 0
|
||||
use_aggressive_host_checking: 0
|
||||
execute_service_checks: 1
|
||||
accept_passive_service_checks: 1
|
||||
execute_host_checks: 1
|
||||
accept_passive_host_checks: 1
|
||||
enable_notifications: 1
|
||||
enable_event_handlers: 1
|
||||
process_performance_data: 0
|
||||
obsess_over_services: 0
|
||||
obsess_over_hosts: 0
|
||||
translate_passive_host_checks: 0
|
||||
passive_host_checks_are_soft: 0
|
||||
check_for_orphaned_services: 1
|
||||
check_for_orphaned_hosts: 1
|
||||
check_service_freshness: 1
|
||||
service_freshness_check_interval: 60
|
||||
check_host_freshness: 0
|
||||
host_freshness_check_interval: 60
|
||||
additional_freshness_latency: 15
|
||||
enable_flap_detection: 1
|
||||
low_service_flap_threshold: 5.0
|
||||
high_service_flap_threshold: 20.0
|
||||
low_host_flap_threshold: 5.0
|
||||
high_host_flap_threshold: 20.0
|
||||
date_format: us
|
||||
use_regexp_matching: 1
|
||||
use_true_regexp_matching: 0
|
||||
daemon_dumps_core: 0
|
||||
use_large_installation_tweaks: 0
|
||||
enable_environment_macros: 0
|
||||
debug_level: 0
|
||||
debug_verbosity: 1
|
||||
debug_file: /opt/nagios/var/nagios.debug
|
||||
max_debug_file_size: 1000000
|
||||
allow_empty_hostgroup_assignment: 1
|
||||
illegal_macro_output_chars: "`~$&|'<>\""
|
||||
|
||||
Nagios CGI Configuration
|
||||
------------------------
|
||||
|
||||
The Nagios CGI configuration is defined via the following section in the chart's
|
||||
values file:
|
||||
|
||||
::
|
||||
|
||||
conf:
|
||||
nagios:
|
||||
cgi:
|
||||
main_config_file: /opt/nagios/etc/nagios.cfg
|
||||
physical_html_path: /opt/nagios/share
|
||||
url_html_path: /nagios
|
||||
show_context_help: 0
|
||||
use_pending_states: 1
|
||||
use_authentication: 0
|
||||
use_ssl_authentication: 0
|
||||
authorized_for_system_information: "*"
|
||||
authorized_for_configuration_information: "*"
|
||||
authorized_for_system_commands: nagiosadmin
|
||||
authorized_for_all_services: "*"
|
||||
authorized_for_all_hosts: "*"
|
||||
authorized_for_all_service_commands: "*"
|
||||
authorized_for_all_host_commands: "*"
|
||||
default_statuswrl_layout: 4
|
||||
ping_syntax: /bin/ping -n -U -c 5 $HOSTADDRESS$
|
||||
refresh_rate: 90
|
||||
result_limit: 100
|
||||
escape_html_tags: 1
|
||||
action_url_target: _blank
|
||||
notes_url_target: _blank
|
||||
lock_author_names: 1
|
||||
navbar_search_for_addresses: 1
|
||||
navbar_search_for_aliases: 1
|
||||
|
||||
Nagios Host Configuration
|
||||
-------------------------
|
||||
|
||||
The Nagios chart includes a single host definition for the Prometheus instance
|
||||
queried for metrics. The host definition can be found under the following
|
||||
values key:
|
||||
|
||||
::
|
||||
|
||||
conf:
|
||||
nagios:
|
||||
hosts:
|
||||
- prometheus:
|
||||
use: linux-server
|
||||
host_name: prometheus
|
||||
alias: "Prometheus Monitoring"
|
||||
address: 127.0.0.1
|
||||
hostgroups: prometheus-hosts
|
||||
check_command: check-prometheus-host-alive
|
||||
|
||||
The address for the Prometheus host is defined by the PROMETHEUS_SERVICE
|
||||
environment variable in the deployment template, which is determined by the
|
||||
monitoring entry in the Nagios chart's endpoints section. The endpoint is then
|
||||
available as a macro for Nagios to use in all Prometheus based queries. For
|
||||
example:
|
||||
|
||||
::
|
||||
|
||||
- check_prometheus_host_alive:
|
||||
command_name: check-prometheus-host-alive
|
||||
command_line: "$USER1$/check_rest_get_api.py --url $USER2$ --warning_response_seconds 5 --critical_response_seconds 10"
|
||||
|
||||
The $USER2$ macro above corresponds to the Prometheus endpoint defined in the
|
||||
PROMETHEUS_SERVICE environment variable. All checks that use the
|
||||
prometheus-hosts hostgroup will map back to the Prometheus host defined by this
|
||||
endpoint.
|
||||
|
||||
Nagios HostGroup Configuration
|
||||
------------------------------
|
||||
|
||||
The Nagios chart includes configuration values for defined host groups under the
|
||||
following values key:
|
||||
|
||||
::
|
||||
|
||||
conf:
|
||||
nagios:
|
||||
host_groups:
|
||||
- prometheus-hosts:
|
||||
hostgroup_name: prometheus-hosts
|
||||
alias: "Prometheus Virtual Host"
|
||||
- base-os:
|
||||
hostgroup_name: base-os
|
||||
alias: "base-os"
|
||||
|
||||
These hostgroups are used to define which group of hosts should be targeted by
|
||||
a particular nagios check. An example of a check that targets Prometheus for a
|
||||
specific metric query would be:
|
||||
|
||||
::
|
||||
|
||||
- check_ceph_monitor_quorum:
|
||||
use: notifying_service
|
||||
hostgroup_name: prometheus-hosts
|
||||
service_description: "CEPH_quorum"
|
||||
check_command: check_prom_alert!ceph_monitor_quorum_low!CRITICAL- ceph monitor quorum does not exist!OK- ceph monitor quorum exists
|
||||
check_interval: 60
|
||||
|
||||
An example of a check that targets all hosts for a base-os type check (memory
|
||||
usage, latency, etc) would be:
|
||||
|
||||
::
|
||||
|
||||
- check_memory_usage:
|
||||
use: notifying_service
|
||||
service_description: Memory_usage
|
||||
check_command: check_memory_usage
|
||||
hostgroup_name: base-os
|
||||
|
||||
These two host groups allow for a wide range of targeted checks for determining
|
||||
the status of all components of an OpenStack-Helm deployment.
|
||||
|
||||
Nagios Command Configuration
|
||||
----------------------------
|
||||
|
||||
The Nagios chart includes configuration values for the command definitions Nagios
|
||||
will use when executing service checks. These values are found under the
|
||||
following key:
|
||||
|
||||
::
|
||||
|
||||
conf:
|
||||
nagios:
|
||||
commands:
|
||||
- send_service_snmp_trap:
|
||||
command_name: send_service_snmp_trap
|
||||
command_line: "$USER1$/send_service_trap.sh '$USER8$' '$HOSTNAME$' '$SERVICEDESC$' $SERVICESTATEID$ '$SERVICEOUTPUT$' '$USER4$' '$USER5$'"
|
||||
- send_host_snmp_trap:
|
||||
command_name: send_host_snmp_trap
|
||||
command_line: "$USER1$/send_host_trap.sh '$USER8$' '$HOSTNAME$' $HOSTSTATEID$ '$HOSTOUTPUT$' '$USER4$' '$USER5$'"
|
||||
- send_service_http_post:
|
||||
command_name: send_service_http_post
|
||||
command_line: "$USER1$/send_http_post_event.py --type service --hostname '$HOSTNAME$' --servicedesc '$SERVICEDESC$' --state_id $SERVICESTATEID$ --output '$SERVICEOUTPUT$' --monitoring_hostname '$HOSTNAME$' --primary_url '$USER6$' --secondary_url '$USER7$'"
|
||||
- send_host_http_post:
|
||||
command_name: send_host_http_post
|
||||
command_line: "$USER1$/send_http_post_event.py --type host --hostname '$HOSTNAME$' --state_id $HOSTSTATEID$ --output '$HOSTOUTPUT$' --monitoring_hostname '$HOSTNAME$' --primary_url '$USER6$' --secondary_url '$USER7$'"
|
||||
- check_prometheus_host_alive:
|
||||
command_name: check-prometheus-host-alive
|
||||
command_line: "$USER1$/check_rest_get_api.py --url $USER2$ --warning_response_seconds 5 --critical_response_seconds 10"
|
||||
|
||||
The list of defined commands can be modified with configuration overrides, which
|
||||
allows for the ability define commands specific to an infrastructure deployment.
|
||||
These commands can include querying Prometheus for metrics on dependencies for a
|
||||
service to determine whether an alert should be raised, executing checks on each
|
||||
host to determine network latency or file system usage, or checking each node
|
||||
for issues with ntp clock skew.
|
||||
|
||||
Note: Since the conf.nagios.commands key contains a list of the defined commands,
|
||||
the entire contents of conf.nagios.commands will need to be overridden if
|
||||
additional commands are desired (due to the immutable nature of lists).
|
||||
|
||||
Nagios Service Check Configuration
|
||||
----------------------------------
|
||||
|
||||
The Nagios chart includes configuration values for the service checks Nagios
|
||||
will execute. These service check commands can be found under the following
|
||||
key:
|
||||
|
||||
::
|
||||
conf:
|
||||
nagios:
|
||||
services:
|
||||
- notifying_service:
|
||||
name: notifying_service
|
||||
use: generic-service
|
||||
flap_detection_enabled: 0
|
||||
process_perf_data: 0
|
||||
contact_groups: snmp_and_http_notifying_contact_group
|
||||
check_interval: 60
|
||||
notification_interval: 120
|
||||
retry_interval: 30
|
||||
register: 0
|
||||
- check_ceph_health:
|
||||
use: notifying_service
|
||||
hostgroup_name: base-os
|
||||
service_description: "CEPH_health"
|
||||
check_command: check_ceph_health
|
||||
check_interval: 300
|
||||
- check_hosts_health:
|
||||
use: generic-service
|
||||
hostgroup_name: prometheus-hosts
|
||||
service_description: "Nodes_health"
|
||||
check_command: check_prom_alert!K8SNodesNotReady!CRITICAL- One or more nodes are not ready.
|
||||
check_interval: 60
|
||||
- check_prometheus_replicas:
|
||||
use: notifying_service
|
||||
hostgroup_name: prometheus-hosts
|
||||
service_description: "Prometheus_replica-count"
|
||||
check_command: check_prom_alert_with_labels!replicas_unavailable_statefulset!statefulset="prometheus"!statefulset {statefulset} has lesser than configured replicas
|
||||
check_interval: 60
|
||||
|
||||
The Nagios service configurations define the checks Nagios will perform. These
|
||||
checks contain keys for defining: the service type to use, the host group to
|
||||
target, the description of the service check, the command the check should use,
|
||||
and the interval at which to trigger the service check. These services can also
|
||||
be extended to provide additional insight into the overall status of a
|
||||
particular service. These services also allow the ability to define advanced
|
||||
checks for determining the overall health and liveness of a service. For
|
||||
example, a service check could trigger an alarm for the OpenStack services when
|
||||
Nagios detects that the relevant database and message queue has become
|
||||
unresponsive.
|
338
doc/source/monitoring/prometheus.rst
Normal file
338
doc/source/monitoring/prometheus.rst
Normal file
@ -0,0 +1,338 @@
|
||||
Prometheus
|
||||
==========
|
||||
|
||||
The Prometheus chart in openstack-helm-infra provides a time series database and
|
||||
a strong querying language for monitoring various components of OpenStack-Helm.
|
||||
Prometheus gathers metrics by scraping defined service endpoints or pods at
|
||||
specified intervals and indexing them in the underlying time series database.
|
||||
|
||||
Authentication
|
||||
--------------
|
||||
|
||||
The Prometheus deployment includes a sidecar container that runs an Apache
|
||||
reverse proxy to add authentication capabilities for Prometheus. The
|
||||
username and password are configured under the monitoring entry in the endpoints
|
||||
section of the chart's values.yaml.
|
||||
|
||||
The configuration for Apache can be found under the conf.httpd key, and uses a
|
||||
helm-toolkit function that allows for including gotpl entries in the template
|
||||
directly. This allows the use of other templates, like the endpoint lookup
|
||||
function templates, directly in the configuration for Apache.
|
||||
|
||||
Prometheus Service configuration
|
||||
--------------------------------
|
||||
|
||||
The Prometheus service is configured via command line flags set during runtime.
|
||||
These flags include: setting the configuration file, setting log levels, setting
|
||||
characteristics of the time series database, and enabling the web admin API for
|
||||
snapshot support. These settings can be configured via the values tree at:
|
||||
|
||||
::
|
||||
|
||||
conf:
|
||||
prometheus:
|
||||
command_line_flags:
|
||||
log.level: info
|
||||
query.max_concurrency: 20
|
||||
query.timeout: 2m
|
||||
storage.tsdb.path: /var/lib/prometheus/data
|
||||
storage.tsdb.retention: 7d
|
||||
web.enable_admin_api: false
|
||||
web.enable_lifecycle: false
|
||||
|
||||
The Prometheus configuration file contains the definitions for scrape targets
|
||||
and the location of the rules files for triggering alerts on scraped metrics.
|
||||
The configuration file is defined in the values file, and can be found at:
|
||||
|
||||
::
|
||||
|
||||
conf:
|
||||
prometheus:
|
||||
scrape_configs: |
|
||||
|
||||
By defining the configuration via the values file, an operator can override all
|
||||
configuration components of the Prometheus deployment at runtime.
|
||||
|
||||
Kubernetes Endpoint Configuration
|
||||
---------------------------------
|
||||
|
||||
The Prometheus chart in openstack-helm-infra uses the built-in service discovery
|
||||
mechanisms for Kubernetes endpoints and pods to automatically configure scrape
|
||||
targets. Functions added to helm-toolkit allows configuration of these targets
|
||||
via annotations that can be applied to any service or pod that exposes metrics
|
||||
for Prometheus, whether a service for an application-specific exporter or an
|
||||
application that provides a metrics endpoint via its service. The values in
|
||||
these functions correspond to entries in the monitoring tree under the
|
||||
prometheus key in a chart's values.yaml file.
|
||||
|
||||
|
||||
The functions definitions are below:
|
||||
|
||||
::
|
||||
|
||||
{{- define "helm-toolkit.snippets.prometheus_service_annotations" -}}
|
||||
{{- $config := index . 0 -}}
|
||||
{{- if $config.scrape }}
|
||||
prometheus.io/scrape: {{ $config.scrape | quote }}
|
||||
{{- end }}
|
||||
{{- if $config.scheme }}
|
||||
prometheus.io/scheme: {{ $config.scheme | quote }}
|
||||
{{- end }}
|
||||
{{- if $config.path }}
|
||||
prometheus.io/path: {{ $config.path | quote }}
|
||||
{{- end }}
|
||||
{{- if $config.port }}
|
||||
prometheus.io/port: {{ $config.port | quote }}
|
||||
{{- end }}
|
||||
{{- end -}}
|
||||
|
||||
::
|
||||
|
||||
{{- define "helm-toolkit.snippets.prometheus_pod_annotations" -}}
|
||||
{{- $config := index . 0 -}}
|
||||
{{- if $config.scrape }}
|
||||
prometheus.io/scrape: {{ $config.scrape | quote }}
|
||||
{{- end }}
|
||||
{{- if $config.path }}
|
||||
prometheus.io/path: {{ $config.path | quote }}
|
||||
{{- end }}
|
||||
{{- if $config.port }}
|
||||
prometheus.io/port: {{ $config.port | quote }}
|
||||
{{- end }}
|
||||
{{- end -}}
|
||||
|
||||
These functions render the following annotations:
|
||||
|
||||
- prometheus.io/scrape: Must be set to true for Prometheus to scrape target
|
||||
- prometheus.io/scheme: Overrides scheme used to scrape target if not http
|
||||
- prometheus.io/path: Overrides path used to scrape target metrics if not /metrics
|
||||
- prometheus.io/port: Overrides port to scrape metrics on if not service's default port
|
||||
|
||||
Each chart that can be targeted for monitoring by Prometheus has a prometheus
|
||||
section under a monitoring tree in the chart's values.yaml, and Prometheus
|
||||
monitoring is disabled by default for those services. Example values for the
|
||||
required entries can be found in the following monitoring configuration for the
|
||||
prometheus-node-exporter chart:
|
||||
|
||||
::
|
||||
|
||||
monitoring:
|
||||
prometheus:
|
||||
enabled: false
|
||||
node_exporter:
|
||||
scrape: true
|
||||
|
||||
If the prometheus.enabled key is set to true, the annotations are set on the
|
||||
targeted service or pod as the condition for applying the annotations evaluates
|
||||
to true. For example:
|
||||
|
||||
::
|
||||
|
||||
{{- $prometheus_annotations := $envAll.Values.monitoring.prometheus.node_exporter }}
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: {{ tuple "node_metrics" "internal" . | include "helm-toolkit.endpoints.hostname_short_endpoint_lookup" }}
|
||||
labels:
|
||||
{{ tuple $envAll "node_exporter" "metrics" | include "helm-toolkit.snippets.kubernetes_metadata_labels" | indent 4 }}
|
||||
annotations:
|
||||
{{- if .Values.monitoring.prometheus.enabled }}
|
||||
{{ tuple $prometheus_annotations | include "helm-toolkit.snippets.prometheus_service_annotations" | indent 4 }}
|
||||
{{- end }}
|
||||
|
||||
Kubelet, API Server, and cAdvisor
|
||||
---------------------------------
|
||||
|
||||
The Prometheus chart includes scrape target configurations for the kubelet, the
|
||||
Kubernetes API servers, and cAdvisor. These targets are configured based on
|
||||
a kubeadm deployed Kubernetes cluster, as OpenStack-Helm uses kubeadm to deploy
|
||||
Kubernetes in the gates. These configurations may need to change based on your
|
||||
chosen method of deployment. Please note the cAdvisor metrics will not be
|
||||
captured if the kubelet was started with the following flag:
|
||||
|
||||
::
|
||||
|
||||
--cadvisor-port=0
|
||||
|
||||
To enable the gathering of the kubelet's custom metrics, the following flag must
|
||||
be set:
|
||||
|
||||
::
|
||||
|
||||
--enable-custom-metrics
|
||||
|
||||
Installation
|
||||
------------
|
||||
|
||||
The Prometheus chart can be installed with the following command:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
helm install --namespace=openstack local/prometheus --name=prometheus
|
||||
|
||||
The above command results in a Prometheus deployment configured to automatically
|
||||
discover services with the necessary annotations for scraping, configured to
|
||||
gather metrics on the kubelet, the Kubernetes API servers, and cAdvisor.
|
||||
|
||||
Extending Prometheus
|
||||
--------------------
|
||||
|
||||
Prometheus can target various exporters to gather metrics related to specific
|
||||
applications to extend visibility into an OpenStack-Helm deployment. Currently,
|
||||
openstack-helm-infra contains charts for:
|
||||
|
||||
- prometheus-kube-state-metrics: Provides additional Kubernetes metrics
|
||||
- prometheus-node-exporter: Provides metrics for nodes and linux kernels
|
||||
- prometheus-openstack-metrics-exporter: Provides metrics for OpenStack services
|
||||
|
||||
Kube-State-Metrics
|
||||
~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The prometheus-kube-state-metrics chart provides metrics for Kubernetes objects
|
||||
as well as metrics for kube-scheduler and kube-controller-manager. Information
|
||||
on the specific metrics available via the kube-state-metrics service can be
|
||||
found in the kube-state-metrics_ documentation.
|
||||
|
||||
The prometheus-kube-state-metrics chart can be installed with the following:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
helm install --namespace=kube-system local/prometheus-kube-state-metrics --name=prometheus-kube-state-metrics
|
||||
|
||||
.. _kube-state-metrics: https://github.com/kubernetes/kube-state-metrics/tree/master/Documentation
|
||||
|
||||
Node Exporter
|
||||
~~~~~~~~~~~~~
|
||||
|
||||
The prometheus-node-exporter chart provides hardware and operating system metrics
|
||||
exposed via Linux kernels. Information on the specific metrics available via
|
||||
the Node exporter can be found on the Node_exporter_ GitHub page.
|
||||
|
||||
The prometheus-node-exporter chart can be installed with the following:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
helm install --namespace=kube-system local/prometheus-node-exporter --name=prometheus-node-exporter
|
||||
|
||||
.. _Node_exporter: https://github.com/prometheus/node_exporter
|
||||
|
||||
OpenStack Exporter
|
||||
~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The prometheus-openstack-exporter chart provides metrics specific to the
|
||||
OpenStack services. The exporter's source code can be found here_. While the
|
||||
metrics provided are by no means comprehensive, they will be expanded upon.
|
||||
|
||||
Please note the OpenStack exporter requires the creation of a Keystone user to
|
||||
successfully gather metrics. To create the required user, the chart uses the
|
||||
same keystone user management job the OpenStack service charts use.
|
||||
|
||||
The prometheus-openstack-exporter chart can be installed with the following:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
helm install --namespace=openstack local/prometheus-openstack-exporter --name=prometheus-openstack-exporter
|
||||
|
||||
.. _here: https://github.com/att-comdev/openstack-metrics-collector
|
||||
|
||||
Other exporters
|
||||
~~~~~~~~~~~~~~~
|
||||
|
||||
Certain charts in OpenStack-Helm include templates for application-specific
|
||||
Prometheus exporters, which keeps the monitoring of those services tightly coupled
|
||||
to the chart. The templates for these exporters can be found in the monitoring
|
||||
subdirectory in the chart. These exporters are disabled by default, and can be
|
||||
enabled by setting the appropriate flag in the monitoring.prometheus key of the
|
||||
chart's values.yaml file. The charts containing exporters include:
|
||||
|
||||
- Elasticsearch_
|
||||
- RabbitMQ_
|
||||
- MariaDB_
|
||||
- Memcached_
|
||||
- Fluentd_
|
||||
- Postgres_
|
||||
|
||||
.. _Elasticsearch: https://github.com/justwatchcom/elasticsearch_exporter
|
||||
.. _RabbitMQ: https://github.com/kbudde/rabbitmq_exporter
|
||||
.. _MariaDB: https://github.com/prometheus/mysqld_exporter
|
||||
.. _Memcached: https://github.com/prometheus/memcached_exporter
|
||||
.. _Fluentd: https://github.com/V3ckt0r/fluentd_exporter
|
||||
.. _Postgres: https://github.com/wrouesnel/postgres_exporter
|
||||
|
||||
Ceph
|
||||
~~~~
|
||||
|
||||
Starting with Luminous, Ceph can export metrics with ceph-mgr prometheus module.
|
||||
This module can be enabled in Ceph's values.yaml under the ceph_mgr_enabled_plugins
|
||||
key by appending prometheus to the list of enabled modules. After enabling the
|
||||
prometheus module, metrics can be scraped on the ceph-mgr service endpoint. This
|
||||
relies on the Prometheus annotations attached to the ceph-mgr service template, and
|
||||
these annotations can be modified in the endpoints section of Ceph's values.yaml
|
||||
file. Information on the specific metrics available via the prometheus module
|
||||
can be found in the Ceph prometheus_ module documentation.
|
||||
|
||||
.. _prometheus: http://docs.ceph.com/docs/master/mgr/prometheus/
|
||||
|
||||
|
||||
Prometheus Dashboard
|
||||
--------------------
|
||||
|
||||
Prometheus includes a dashboard that can be accessed via the accessible
|
||||
Prometheus endpoint (NodePort or otherwise). This dashboard will give you a
|
||||
view of your scrape targets' state, the configuration values for Prometheus's
|
||||
scrape jobs and command line flags, a view of any alerts triggered based on the
|
||||
defined rules, and a means for using PromQL to query scraped metrics. The
|
||||
Prometheus dashboard is a useful tool for verifying Prometheus is configured
|
||||
appropriately and to verify the status of any services targeted for scraping via
|
||||
the Prometheus service discovery annotations.
|
||||
|
||||
Rules Configuration
|
||||
-------------------
|
||||
|
||||
Prometheus provides a querying language that can operate on defined rules which
|
||||
allow for the generation of alerts on specific metrics. The Prometheus chart in
|
||||
openstack-helm-infra defines these rules via the values.yaml file. By defining
|
||||
these in the values file, it allows operators flexibility to provide specific
|
||||
rules via overrides at installation. The following rules keys are provided:
|
||||
|
||||
::
|
||||
|
||||
values:
|
||||
conf:
|
||||
rules:
|
||||
alertmanager:
|
||||
etcd3:
|
||||
kube_apiserver:
|
||||
kube_controller_manager:
|
||||
kubelet:
|
||||
kubernetes:
|
||||
rabbitmq:
|
||||
mysql:
|
||||
ceph:
|
||||
openstack:
|
||||
custom:
|
||||
|
||||
These provided keys provide recording and alert rules for all infrastructure
|
||||
components of an OpenStack-Helm deployment. If you wish to exclude rules for a
|
||||
component, leave the tree empty in an overrides file. To read more
|
||||
about Prometheus recording and alert rules definitions, please see the official
|
||||
Prometheus recording_ and alert_ rules documentation.
|
||||
|
||||
.. _recording: https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/
|
||||
.. _alert: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
|
||||
|
||||
Note: Prometheus releases prior to 2.0 used gotpl to define rules. Prometheus
|
||||
2.0 changed the rules format to YAML, making them much easier to read. The
|
||||
Prometheus chart in openstack-helm-infra uses Prometheus 2.0 by default to take
|
||||
advantage of changes to the underlying storage layer and the handling of stale
|
||||
data. The chart will not support overrides for Prometheus versions below 2.0,
|
||||
as the command line flags for the service changed between versions.
|
||||
|
||||
The wide range of exporters included in OpenStack-Helm coupled with the ability
|
||||
to define rules with configuration overrides allows for the addition of custom
|
||||
alerting and recording rules to fit an operator's monitoring needs. Adding new
|
||||
rules or modifying existing rules require overrides for either an existing key
|
||||
under conf.rules or the addition of a new key under conf.rules. The addition
|
||||
of custom rules can be used to define complex checks that can be extended for
|
||||
determining the liveliness or health of infrastructure components.
|
1
doc/source/readme.rst
Normal file
1
doc/source/readme.rst
Normal file
@ -0,0 +1 @@
|
||||
.. include:: ../../README.rst
|
Loading…
Reference in New Issue
Block a user