Foundation for LMA docs

This begins building documentation for the LMA services included
in openstack-helm-infra. This includes documentation for: kibana,
elasticsearch, fluent-logging, grafana, prometheus, and nagios

Change-Id: Iaa24be04748e76fabca998972398802e7e921ef1
Signed-off-by: Steve Wilkerson <wilkers.steve@gmail.com>
This commit is contained in:
Steve Wilkerson 2018-05-15 15:14:14 -05:00
parent 1c87af7856
commit eab9ca05a6
10 changed files with 1369 additions and 1 deletions

View File

@ -8,7 +8,9 @@ Contents:
install/index
testing/index
monitoring/index
logging/index
readme
Indices and Tables
==================

View File

@ -0,0 +1,196 @@
Elasticsearch
=============
The Elasticsearch chart in openstack-helm-infra provides a distributed data
store to index and analyze logs generated from the OpenStack-Helm services.
The chart contains templates for:
- Elasticsearch client nodes
- Elasticsearch data nodes
- Elasticsearch master nodes
- An Elasticsearch exporter for providing cluster metrics to Prometheus
- A cronjob for Elastic Curator to manage data indices
Authentication
--------------
The Elasticsearch deployment includes a sidecar container that runs an Apache
reverse proxy to add authentication capabilities for Elasticsearch. The
username and password are configured under the Elasticsearch entry in the
endpoints section of the chart's values.yaml.
The configuration for Apache can be found under the conf.httpd key, and uses a
helm-toolkit function that allows for including gotpl entries in the template
directly. This allows the use of other templates, like the endpoint lookup
function templates, directly in the configuration for Apache.
Elasticsearch Service Configuration
-----------------------------------
The Elasticsearch service configuration file can be modified with a combination
of pod environment variables and entries in the values.yaml file. Elasticsearch
does not require much configuration out of the box, and the default values for
these configuration settings are meant to provide a highly available cluster by
default.
The vital entries in this configuration file are:
- path.data: The path at which to store the indexed data
- path.repo: The location of any snapshot repositories to backup indexes
- bootstrap.memory_lock: Ensures none of the JVM is swapped to disk
- discovery.zen.minimum_master_nodes: Minimum required masters for the cluster
The bootstrap.memory_lock entry ensures none of the JVM will be swapped to disk
during execution, and setting this value to false will negatively affect the
health of your Elasticsearch nodes. The discovery.zen.minimum_master_nodes flag
registers the minimum number of masters required for your Elasticsearch cluster
to register as healthy and functional.
To read more about Elasticsearch's configuration file, please see the official
documentation_.
.. _documentation: https://www.elastic.co/guide/en/elasticsearch/reference/current/important-settings.html
Elastic Curator
---------------
The Elasticsearch chart contains a cronjob to run Elastic Curator at specified
intervals to manage the lifecycle of your indices. Curator can perform:
- Take and send a snapshot of your indexes to a specified snapshot repository
- Delete indexes older than a specified length of time
- Restore indexes with previous index snapshots
- Reindex an index into a new or preexisting index
The full list of supported Curator actions can be found in the actions_ section of
the official Curator documentation. The list of options available for those
actions can be found in the options_ section of the Curator documentation.
.. _actions: https://www.elastic.co/guide/en/elasticsearch/client/curator/current/actions.html
.. _options: https://www.elastic.co/guide/en/elasticsearch/client/curator/current/options.html
Curator's configuration is handled via entries in Elasticsearch's values.yaml
file and must be overridden to achieve your index lifecycle management
needs. Please note that any unused field should be left blank, as an entry of
"None" will result in an exception, as Curator will read it as a Python NoneType
insead of a value of None.
The section for Curator's service configuration can be found at:
::
conf:
curator:
config:
client:
hosts:
- elasticsearch-logging
port: 9200
url_prefix:
use_ssl: False
certificate:
client_cert:
client_key:
ssl_no_validate: False
http_auth:
timeout: 30
master_only: False
logging:
loglevel: INFO
logfile:
logformat: default
blacklist: ['elasticsearch', 'urllib3']
Curator's actions are configured in the following section:
::
conf:
curator:
action_file:
actions:
1:
action: delete_indices
description: "Clean up ES by deleting old indices"
options:
timeout_override:
continue_if_exception: False
ignore_empty_list: True
disable_action: True
filters:
- filtertype: age
source: name
direction: older
timestring: '%Y.%m.%d'
unit: days
unit_count: 30
field:
stats_result:
epoch:
exclude: False
The Elasticsearch chart contains commented example actions for deleting and
snapshotting indexes older 30 days. Please note these actions are provided as a
reference and are disabled by default to avoid any unexpected behavior against
your indexes.
Elasticsearch Exporter
----------------------
The Elasticsearch chart contains templates for an exporter to provide metrics
for Prometheus. These metrics provide insight into the performance and overall
health of your Elasticsearch cluster. Please note monitoring for Elasticsearch
is disabled by default, and must be enabled with the following override:
::
monitoring:
prometheus:
enabled: true
The Elasticsearch exporter uses the same service annotations as the other
exporters, and no additional configuration is required for Prometheus to target
the Elasticsearch exporter for scraping. The Elasticsearch exporter is
configured with command line flags, and the flags' default values can be found
under the following key in the values.yaml file:
::
conf:
prometheus_elasticsearch_exporter:
es:
all: true
timeout: 20s
The configuration keys configure the following behaviors:
- es.all: Gather information from all nodes, not just the connecting node
- es.timeout: Timeout for metrics queries
More information about the Elasticsearch exporter can be found on the exporter's
GitHub_ page.
.. _GitHub: https://github.com/justwatchcom/elasticsearch_exporter
Snapshot Repositories
---------------------
Before Curator can store snapshots in a specified repository, Elasticsearch must
register the configured repository. To achieve this, the Elasticsearch chart
contains a job for registering an s3 snapshot repository backed by radosgateway.
This job is disabled by default as the curator actions for snapshots are
disabled by default. To enable the snapshot job, the
conf.elasticsearch.snapshots.enabled flag must be set to true. The following
configuration keys are relevant:
- conf.elasticsearch.snapshots.enabled: Enable snapshot repositories
- conf.elasticsearch.snapshots.bucket: Name of the RGW s3 bucket to use
- conf.elasticsearch.snapshots.repositories: Name of repositories to create
More information about Elasticsearch repositories can be found in the official
Elasticsearch snapshot_ documentation:
.. _snapshot: https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-snapshots.html#_repositories

View File

@ -0,0 +1,279 @@
Fluent-logging
===============
The fluent-logging chart in openstack-helm-infra provides the base for a
centralized logging platform for OpenStack-Helm. The chart combines two
services, Fluentbit and Fluentd, to gather logs generated by the services,
filter on or add metadata to logged events, then forward them to Elasticsearch
for indexing.
Fluentbit
---------
Fluentbit runs as a log-collecting component on each host in the cluster, and
can be configured to target specific log locations on the host. The Fluentbit_
configuration schema can be found on the official Fluentbit website.
.. _Fluentbit: http://fluentbit.io/documentation/0.12/configuration/schema.html
Fluentbit provides a set of plug-ins for ingesting and filtering various log
types. These plug-ins include:
- Tail: Tails a defined file for logged events
- Kube: Adds Kubernetes metadata to a logged event
- Systemd: Provides ability to collect logs from the journald daemon
- Syslog: Provides the ability to collect logs from a Unix socket (TCP or UDP)
The complete list of plugins can be found in the configuration_ section of the
Fluentbit documentation.
.. _configuration: http://fluentbit.io/documentation/current/configuration/
Fluentbit uses parsers to turn unstructured log entries into structured entries
to make processing and filtering events easier. The two formats supported are
JSON maps and regular expressions. More information about Fluentbit's parsing
abilities can be found in the parsers_ section of Fluentbit's documentation.
.. _parsers: http://fluentbit.io/documentation/current/parser/
Fluentbit's service and parser configurations are defined via the values.yaml
file, which allows for custom definitions of inputs, filters and outputs for
your logging needs.
Fluentbit's configuration can be found under the following key:
::
conf:
fluentbit:
- service:
header: service
Flush: 1
Daemon: Off
Log_Level: info
Parsers_File: parsers.conf
- containers_tail:
header: input
Name: tail
Tag: kube.*
Path: /var/log/containers/*.log
Parser: docker
DB: /var/log/flb_kube.db
Mem_Buf_Limit: 5MB
- kube_filter:
header: filter
Name: kubernetes
Match: kube.*
Merge_JSON_Log: On
- fluentd_output:
header: output
Name: forward
Match: "*"
Host: ${FLUENTD_HOST}
Port: ${FLUENTD_PORT}
Fluentbit is configured by default to capture logs at the info log level. To
change this, override the Log_Level key with the appropriate levels, which are
documented in Fluentbit's configuration_.
Fluentbit's parser configuration can be found under the following key:
::
conf:
parsers:
- docker:
header: parser
Name: docker
Format: json
Time_Key: time
Time_Format: "%Y-%m-%dT%H:%M:%S.%L"
Time_Keep: On
The values for the fluentbit and parsers keys are consumed by a fluent-logging
helper template that produces the appropriate configurations for the relevant
sections. Each list item (keys prefixed with a '-') represents a section in the
configuration files, and the arbitrary name of the list item should represent a
logical description of the section defined. The header key represents the type
of definition (filter, input, output, service or parser), and the remaining
entries will be rendered as space delimited configuration keys and values. For
example, the definitions above would result in the following:
::
[SERVICE]
Daemon false
Flush 1
Log_Level info
Parsers_File parsers.conf
[INPUT]
DB /var/log/flb_kube.db
Mem_Buf_Limit 5MB
Name tail
Parser docker
Path /var/log/containers/*.log
Tag kube.*
[FILTER]
Match kube.*
Merge_JSON_Log true
Name kubernetes
[OUTPUT]
Host ${FLUENTD_HOST}
Match *
Name forward
Port ${FLUENTD_PORT}
[PARSER]
Format json
Name docker
Time_Format %Y-%m-%dT%H:%M:%S.%L
Time_Keep true
Time_Key time
Fluentd
-------
Fluentd runs as a forwarding service that receives event entries from Fluentbit
and routes them to the appropriate destination. By default, Fluentd will route
all entries received from Fluentbit to Elasticsearch for indexing. The
Fluentd_ configuration schema can be found at the official Fluentd website.
.. _Fluentd: https://docs.fluentd.org/v0.12/articles/config-file
Fluentd's configuration is handled in the values.yaml file in fluent-logging.
Similar to Fluentbit, configuration overrides provide flexibility in defining
custom routes for tagged log events. The configuration can be found under the
following key:
::
conf:
fluentd:
- fluentbit_forward:
header: source
type: forward
port: "#{ENV['FLUENTD_PORT']}"
bind: 0.0.0.0
- elasticsearch:
header: match
type: elasticsearch
expression: "**"
include_tag_key: true
host: "#{ENV['ELASTICSEARCH_HOST']}"
port: "#{ENV['ELASTICSEARCH_PORT']}"
logstash_format: true
buffer_chunk_limit: 10M
buffer_queue_limit: 32
flush_interval: "20"
max_retry_wait: 300
disable_retry_limit: ""
The values for the fluentd keys are consumed by a fluent-logging helper template
that produces appropriate configurations for each directive desired. The list
items (keys prefixed with a '-') represent sections in the configuration file,
and the name of each list item should represent a logical description of the
section defined. The header key represents the type of definition (name of the
fluentd plug-in used), and the expression key is used when the plug-in requires
a pattern to match against (example: matches on certain input patterns). The
remaining entries will be rendered as space delimited configuration keys and
values. For example, the definition above would result in the following:
::
<source>
bind 0.0.0.0
port "#{ENV['FLUENTD_PORT']}"
@type forward
</source>
<match **>
buffer_chunk_limit 10M
buffer_queue_limit 32
disable_retry_limit
flush_interval 20s
host "#{ENV['ELASTICSEARCH_HOST']}"
include_tag_key true
logstash_format true
max_retry_wait 300
port "#{ENV['ELASTICSEARCH_PORT']}"
@type elasticsearch
</match>
Some fluentd plug-ins require nested definitions. The fluentd helper template
can handle these definitions with the following structure:
::
conf:
td_agent:
- fluentbit_forward:
header: source
type: forward
port: "#{ENV['FLUENTD_PORT']}"
bind: 0.0.0.0
- log_transformer:
header: filter
type: record_transformer
expression: "foo.bar"
inner_def:
- record_transformer:
header: record
hostname: my_host
tag: my_tag
In this example, the my_transformer list will generate a nested configuration
entry in the log_transformer section. The nested definitions are handled by
supplying a list as the value for an arbitrary key, and the list value will
indicate the entry should be handled as a nested definition. The helper
template will render the above example key/value pairs as the following:
::
<source>
bind 0.0.0.0
port "#{ENV['FLUENTD_PORT']}"
@type forward
</source>
<filter foo.bar>
<record>
hostname my_host
tag my_tag
</record>
@type record_transformer
</filter>
Fluentd Exporter
----------------------
The fluent-logging chart contains templates for an exporter to provide metrics
for Fluentd. These metrics provide insight into Fluentd's performance. Please
note monitoring for Fluentd is disabled by default, and must be enabled with the
following override:
::
monitoring:
prometheus:
enabled: true
The Fluentd exporter uses the same service annotations as the other exporters,
and no additional configuration is required for Prometheus to target the
Fluentd exporter for scraping. The Fluentd exporter is configured with command
line flags, and the flags' default values can be found under the following key
in the values.yaml file:
::
conf:
fluentd_exporter:
log:
format: "logger:stdout?json=true"
level: "info"
The configuration keys configure the following behaviors:
- log.format: Define the logger used and format of the output
- log.level: Log level for the exporter to use
More information about the Fluentd exporter can be found on the exporter's
GitHub_ page.
.. _GitHub: https://github.com/V3ckt0r/fluentd_exporter

View File

@ -0,0 +1,11 @@
OpenStack-Helm Logging
======================
Contents:
.. toctree::
:maxdepth: 2
elasticsearch
fluent-logging
kibana

View File

@ -0,0 +1,76 @@
Kibana
======
The Kibana chart in OpenStack-Helm Infra provides visualization for logs indexed
into Elasticsearch. These visualizations provide the means to view logs captured
from services deployed in cluster and targeted for collection by Fluentbit.
Authentication
--------------
The Kibana deployment includes a sidecar container that runs an Apache reverse
proxy to add authentication capabilities for Kibana. The username and password
are configured under the Kibana entry in the endpoints section of the chart's
values.yaml.
The configuration for Apache can be found under the conf.httpd key, and uses a
helm-toolkit function that allows for including gotpl entries in the template
directly. This allows the use of other templates, like the endpoint lookup
function templates, directly in the configuration for Apache.
Configuration
-------------
Kibana's configuration is driven by the chart's values.yaml file. The configuration
options are found under the following keys:
::
conf:
elasticsearch:
pingTimeout: 1500
preserveHost: true
requestTimeout: 30000
shardTimeout: 0
startupTimeout: 5000
il8n:
defaultLocale: en
kibana:
defaultAppId: discover
index: .kibana
logging:
quiet: false
silent: false
verbose: false
ops:
interval: 5000
server:
host: localhost
maxPayloadBytes: 1048576
port: 5601
ssl:
enabled: false
The case of the sub-keys is important as these values are injected into
Kibana's configuration configmap with the toYaml function. More information on
the configuration options and available settings can be found in the official
Kibana documentation_.
.. _documentation: https://www.elastic.co/guide/en/kibana/current/settings.html
Installation
------------
.. code_block: bash
helm install --namespace=<namespace> local/kibana --name=kibana
Setting Time Field
------------------
For Kibana to successfully read the logs from Elasticsearch's indexes, the time
field will need to be manually set after Kibana has successfully deployed. Upon
visiting the Kibana dashboard for the first time, a prompt will appear to choose the
time field with a drop down menu. The default time field for Elasticsearch indexes
is '@timestamp'. Once this field is selected, the default view for querying log entries
can be found by selecting the "Discover"

View File

@ -0,0 +1,89 @@
Grafana
=======
The Grafana chart in OpenStack-Helm Infra provides default dashboards for the
metrics gathered with Prometheus. The default dashboards include visualizations
for metrics on: Ceph, Kubernetes, nodes, containers, MySQL, RabbitMQ, and
OpenStack.
Configuration
-------------
Grafana
~~~~~~~
Grafana's configuration is driven with the chart's values.YAML file, and the
relevant configuration entries are under the following key:
::
conf:
grafana:
paths:
server:
database:
session:
security:
users:
log:
log.console:
dashboards.json:
grafana_net:
These keys correspond to sections in the grafana.ini configuration file, and the
to_ini helm-toolkit function will render these values into the appropriate
format in grafana.ini. The list of options for these keys can be found in the
official Grafana configuration_ documentation.
.. _configuration: http://docs.grafana.org/installation/configuration/
Prometheus Data Source
~~~~~~~~~~~~~~~~~~~~~~
Grafana requires configured data sources for gathering metrics for display in
its dashboards. The configuration options for datasources are found under the
following key in Grafana's values.YAML file:
::
conf:
provisioning:
datasources;
monitoring:
name: prometheus
type: prometheus
access: proxy
orgId: 1
editable: true
basicAuth: true
The Grafana chart will use the keys under each entry beneath
.conf.provisioning.datasources as inputs to a helper template that will render
the appropriate configuration for the data source. The key for each data source
(monitoring in the above example) should map to an entry in the endpoints
section in the chart's values.yaml, as the data source's URL and authentication
credentials will be populated by the values defined in the defined endpoint.
.. _sources: http://docs.grafana.org/features/datasources/
Dashboards
~~~~~~~~~~
Grafana adds dashboards during installation with dashboards defined in YAML under
the following key:
::
conf:
dashboards:
These YAML definitiions are transformed to JSON are added to Grafana's
configuration configmap and mounted to the Grafana pods dynamically, allowing for
flexibility in defining and adding custom dashboards to Grafana. Dashboards can
be added by inserting a new key along with a YAML dashboard definition as the
value. Additional dashboards can be found by searching on Grafana's dashboards_
page or you can define your own. A json-to-YAML tool, such as json2yaml_ , will
help transform any custom or new dashboards from JSON to YAML.
.. _json2yaml: https://www.json2yaml.com/

View File

@ -0,0 +1,11 @@
OpenStack-Helm Monitoring
=========================
Contents:
.. toctree::
:maxdepth: 2
grafana
prometheus
nagios

View File

@ -0,0 +1,365 @@
Nagios
======
The Nagios chart in openstack-helm-infra can be used to provide an alarming
service that's tightly coupled to an OpenStack-Helm deployment. The Nagios
chart uses a custom Nagios core image that includes plugins developed to query
Prometheus directly for scraped metrics and triggered alarms, query the Ceph
manager endpoints directly to determine the health of a Ceph cluster, and to
query Elasticsearch for logged events that meet certain criteria (experimental).
Authentication
--------------
The Nagios deployment includes a sidecar container that runs an Apache reverse
proxy to add authentication capabilities for Nagios. The username and password
are configured under the nagios entry in the endpoints section of the chart's
values.yaml.
The configuration for Apache can be found under the conf.httpd key, and uses a
helm-toolkit function that allows for including gotpl entries in the template
directly. This allows the use of other templates, like the endpoint lookup
function templates, directly in the configuration for Apache.
Image Plugins
-------------
The Nagios image used contains custom plugins that can be used for the defined
service check commands. These plugins include:
- check_prometheus_metric.py: Query Prometheus for a specific metric and value
- check_exporter_health_metric.sh: Nagios plugin to query prometheus exporter
- check_rest_get_api.py: Check REST API status
- check_update_prometheus_hosts.py: Queries Prometheus, updates Nagios config
- query_prometheus_alerts.py: Nagios plugin to query prometheus ALERTS metric
More information about the Nagios image and plugins can be found here_.
.. _here: https://github.com/att-comdev/nagios
Nagios Service Configuration
----------------------------
The Nagios service is configured via the following section in the chart's
values file:
::
conf:
nagios:
nagios:
log_file: /opt/nagios/var/log/nagios.log
cfg_file:
- /opt/nagios/etc/nagios_objects.cfg
- /opt/nagios/etc/objects/commands.cfg
- /opt/nagios/etc/objects/contacts.cfg
- /opt/nagios/etc/objects/timeperiods.cfg
- /opt/nagios/etc/objects/templates.cfg
- /opt/nagios/etc/objects/prometheus_discovery_objects.cfg
object_cache_file: /opt/nagios/var/objects.cache
precached_object_file: /opt/nagios/var/objects.precache
resource_file: /opt/nagios/etc/resource.cfg
status_file: /opt/nagios/var/status.dat
status_update_interval: 10
nagios_user: nagios
nagios_group: nagios
check_external_commands: 1
command_file: /opt/nagios/var/rw/nagios.cmd
lock_file: /var/run/nagios.lock
temp_file: /opt/nagios/var/nagios.tmp
temp_path: /tmp
event_broker_options: -1
log_rotation_method: d
log_archive_path: /opt/nagios/var/log/archives
use_syslog: 1
log_service_retries: 1
log_host_retries: 1
log_event_handlers: 1
log_initial_states: 0
log_current_states: 1
log_external_commands: 1
log_passive_checks: 1
service_inter_check_delay_method: s
max_service_check_spread: 30
service_interleave_factor: s
host_inter_check_delay_method: s
max_host_check_spread: 30
max_concurrent_checks: 60
check_result_reaper_frequency: 10
max_check_result_reaper_time: 30
check_result_path: /opt/nagios/var/spool/checkresults
max_check_result_file_age: 3600
cached_host_check_horizon: 15
cached_service_check_horizon: 15
enable_predictive_host_dependency_checks: 1
enable_predictive_service_dependency_checks: 1
soft_state_dependencies: 0
auto_reschedule_checks: 0
auto_rescheduling_interval: 30
auto_rescheduling_window: 180
service_check_timeout: 60
host_check_timeout: 60
event_handler_timeout: 60
notification_timeout: 60
ocsp_timeout: 5
perfdata_timeout: 5
retain_state_information: 1
state_retention_file: /opt/nagios/var/retention.dat
retention_update_interval: 60
use_retained_program_state: 1
use_retained_scheduling_info: 1
retained_host_attribute_mask: 0
retained_service_attribute_mask: 0
retained_process_host_attribute_mask: 0
retained_process_service_attribute_mask: 0
retained_contact_host_attribute_mask: 0
retained_contact_service_attribute_mask: 0
interval_length: 1
check_workers: 4
check_for_updates: 1
bare_update_check: 0
use_aggressive_host_checking: 0
execute_service_checks: 1
accept_passive_service_checks: 1
execute_host_checks: 1
accept_passive_host_checks: 1
enable_notifications: 1
enable_event_handlers: 1
process_performance_data: 0
obsess_over_services: 0
obsess_over_hosts: 0
translate_passive_host_checks: 0
passive_host_checks_are_soft: 0
check_for_orphaned_services: 1
check_for_orphaned_hosts: 1
check_service_freshness: 1
service_freshness_check_interval: 60
check_host_freshness: 0
host_freshness_check_interval: 60
additional_freshness_latency: 15
enable_flap_detection: 1
low_service_flap_threshold: 5.0
high_service_flap_threshold: 20.0
low_host_flap_threshold: 5.0
high_host_flap_threshold: 20.0
date_format: us
use_regexp_matching: 1
use_true_regexp_matching: 0
daemon_dumps_core: 0
use_large_installation_tweaks: 0
enable_environment_macros: 0
debug_level: 0
debug_verbosity: 1
debug_file: /opt/nagios/var/nagios.debug
max_debug_file_size: 1000000
allow_empty_hostgroup_assignment: 1
illegal_macro_output_chars: "`~$&|'<>\""
Nagios CGI Configuration
------------------------
The Nagios CGI configuration is defined via the following section in the chart's
values file:
::
conf:
nagios:
cgi:
main_config_file: /opt/nagios/etc/nagios.cfg
physical_html_path: /opt/nagios/share
url_html_path: /nagios
show_context_help: 0
use_pending_states: 1
use_authentication: 0
use_ssl_authentication: 0
authorized_for_system_information: "*"
authorized_for_configuration_information: "*"
authorized_for_system_commands: nagiosadmin
authorized_for_all_services: "*"
authorized_for_all_hosts: "*"
authorized_for_all_service_commands: "*"
authorized_for_all_host_commands: "*"
default_statuswrl_layout: 4
ping_syntax: /bin/ping -n -U -c 5 $HOSTADDRESS$
refresh_rate: 90
result_limit: 100
escape_html_tags: 1
action_url_target: _blank
notes_url_target: _blank
lock_author_names: 1
navbar_search_for_addresses: 1
navbar_search_for_aliases: 1
Nagios Host Configuration
-------------------------
The Nagios chart includes a single host definition for the Prometheus instance
queried for metrics. The host definition can be found under the following
values key:
::
conf:
nagios:
hosts:
- prometheus:
use: linux-server
host_name: prometheus
alias: "Prometheus Monitoring"
address: 127.0.0.1
hostgroups: prometheus-hosts
check_command: check-prometheus-host-alive
The address for the Prometheus host is defined by the PROMETHEUS_SERVICE
environment variable in the deployment template, which is determined by the
monitoring entry in the Nagios chart's endpoints section. The endpoint is then
available as a macro for Nagios to use in all Prometheus based queries. For
example:
::
- check_prometheus_host_alive:
command_name: check-prometheus-host-alive
command_line: "$USER1$/check_rest_get_api.py --url $USER2$ --warning_response_seconds 5 --critical_response_seconds 10"
The $USER2$ macro above corresponds to the Prometheus endpoint defined in the
PROMETHEUS_SERVICE environment variable. All checks that use the
prometheus-hosts hostgroup will map back to the Prometheus host defined by this
endpoint.
Nagios HostGroup Configuration
------------------------------
The Nagios chart includes configuration values for defined host groups under the
following values key:
::
conf:
nagios:
host_groups:
- prometheus-hosts:
hostgroup_name: prometheus-hosts
alias: "Prometheus Virtual Host"
- base-os:
hostgroup_name: base-os
alias: "base-os"
These hostgroups are used to define which group of hosts should be targeted by
a particular nagios check. An example of a check that targets Prometheus for a
specific metric query would be:
::
- check_ceph_monitor_quorum:
use: notifying_service
hostgroup_name: prometheus-hosts
service_description: "CEPH_quorum"
check_command: check_prom_alert!ceph_monitor_quorum_low!CRITICAL- ceph monitor quorum does not exist!OK- ceph monitor quorum exists
check_interval: 60
An example of a check that targets all hosts for a base-os type check (memory
usage, latency, etc) would be:
::
- check_memory_usage:
use: notifying_service
service_description: Memory_usage
check_command: check_memory_usage
hostgroup_name: base-os
These two host groups allow for a wide range of targeted checks for determining
the status of all components of an OpenStack-Helm deployment.
Nagios Command Configuration
----------------------------
The Nagios chart includes configuration values for the command definitions Nagios
will use when executing service checks. These values are found under the
following key:
::
conf:
nagios:
commands:
- send_service_snmp_trap:
command_name: send_service_snmp_trap
command_line: "$USER1$/send_service_trap.sh '$USER8$' '$HOSTNAME$' '$SERVICEDESC$' $SERVICESTATEID$ '$SERVICEOUTPUT$' '$USER4$' '$USER5$'"
- send_host_snmp_trap:
command_name: send_host_snmp_trap
command_line: "$USER1$/send_host_trap.sh '$USER8$' '$HOSTNAME$' $HOSTSTATEID$ '$HOSTOUTPUT$' '$USER4$' '$USER5$'"
- send_service_http_post:
command_name: send_service_http_post
command_line: "$USER1$/send_http_post_event.py --type service --hostname '$HOSTNAME$' --servicedesc '$SERVICEDESC$' --state_id $SERVICESTATEID$ --output '$SERVICEOUTPUT$' --monitoring_hostname '$HOSTNAME$' --primary_url '$USER6$' --secondary_url '$USER7$'"
- send_host_http_post:
command_name: send_host_http_post
command_line: "$USER1$/send_http_post_event.py --type host --hostname '$HOSTNAME$' --state_id $HOSTSTATEID$ --output '$HOSTOUTPUT$' --monitoring_hostname '$HOSTNAME$' --primary_url '$USER6$' --secondary_url '$USER7$'"
- check_prometheus_host_alive:
command_name: check-prometheus-host-alive
command_line: "$USER1$/check_rest_get_api.py --url $USER2$ --warning_response_seconds 5 --critical_response_seconds 10"
The list of defined commands can be modified with configuration overrides, which
allows for the ability define commands specific to an infrastructure deployment.
These commands can include querying Prometheus for metrics on dependencies for a
service to determine whether an alert should be raised, executing checks on each
host to determine network latency or file system usage, or checking each node
for issues with ntp clock skew.
Note: Since the conf.nagios.commands key contains a list of the defined commands,
the entire contents of conf.nagios.commands will need to be overridden if
additional commands are desired (due to the immutable nature of lists).
Nagios Service Check Configuration
----------------------------------
The Nagios chart includes configuration values for the service checks Nagios
will execute. These service check commands can be found under the following
key:
::
conf:
nagios:
services:
- notifying_service:
name: notifying_service
use: generic-service
flap_detection_enabled: 0
process_perf_data: 0
contact_groups: snmp_and_http_notifying_contact_group
check_interval: 60
notification_interval: 120
retry_interval: 30
register: 0
- check_ceph_health:
use: notifying_service
hostgroup_name: base-os
service_description: "CEPH_health"
check_command: check_ceph_health
check_interval: 300
- check_hosts_health:
use: generic-service
hostgroup_name: prometheus-hosts
service_description: "Nodes_health"
check_command: check_prom_alert!K8SNodesNotReady!CRITICAL- One or more nodes are not ready.
check_interval: 60
- check_prometheus_replicas:
use: notifying_service
hostgroup_name: prometheus-hosts
service_description: "Prometheus_replica-count"
check_command: check_prom_alert_with_labels!replicas_unavailable_statefulset!statefulset="prometheus"!statefulset {statefulset} has lesser than configured replicas
check_interval: 60
The Nagios service configurations define the checks Nagios will perform. These
checks contain keys for defining: the service type to use, the host group to
target, the description of the service check, the command the check should use,
and the interval at which to trigger the service check. These services can also
be extended to provide additional insight into the overall status of a
particular service. These services also allow the ability to define advanced
checks for determining the overall health and liveness of a service. For
example, a service check could trigger an alarm for the OpenStack services when
Nagios detects that the relevant database and message queue has become
unresponsive.

View File

@ -0,0 +1,338 @@
Prometheus
==========
The Prometheus chart in openstack-helm-infra provides a time series database and
a strong querying language for monitoring various components of OpenStack-Helm.
Prometheus gathers metrics by scraping defined service endpoints or pods at
specified intervals and indexing them in the underlying time series database.
Authentication
--------------
The Prometheus deployment includes a sidecar container that runs an Apache
reverse proxy to add authentication capabilities for Prometheus. The
username and password are configured under the monitoring entry in the endpoints
section of the chart's values.yaml.
The configuration for Apache can be found under the conf.httpd key, and uses a
helm-toolkit function that allows for including gotpl entries in the template
directly. This allows the use of other templates, like the endpoint lookup
function templates, directly in the configuration for Apache.
Prometheus Service configuration
--------------------------------
The Prometheus service is configured via command line flags set during runtime.
These flags include: setting the configuration file, setting log levels, setting
characteristics of the time series database, and enabling the web admin API for
snapshot support. These settings can be configured via the values tree at:
::
conf:
prometheus:
command_line_flags:
log.level: info
query.max_concurrency: 20
query.timeout: 2m
storage.tsdb.path: /var/lib/prometheus/data
storage.tsdb.retention: 7d
web.enable_admin_api: false
web.enable_lifecycle: false
The Prometheus configuration file contains the definitions for scrape targets
and the location of the rules files for triggering alerts on scraped metrics.
The configuration file is defined in the values file, and can be found at:
::
conf:
prometheus:
scrape_configs: |
By defining the configuration via the values file, an operator can override all
configuration components of the Prometheus deployment at runtime.
Kubernetes Endpoint Configuration
---------------------------------
The Prometheus chart in openstack-helm-infra uses the built-in service discovery
mechanisms for Kubernetes endpoints and pods to automatically configure scrape
targets. Functions added to helm-toolkit allows configuration of these targets
via annotations that can be applied to any service or pod that exposes metrics
for Prometheus, whether a service for an application-specific exporter or an
application that provides a metrics endpoint via its service. The values in
these functions correspond to entries in the monitoring tree under the
prometheus key in a chart's values.yaml file.
The functions definitions are below:
::
{{- define "helm-toolkit.snippets.prometheus_service_annotations" -}}
{{- $config := index . 0 -}}
{{- if $config.scrape }}
prometheus.io/scrape: {{ $config.scrape | quote }}
{{- end }}
{{- if $config.scheme }}
prometheus.io/scheme: {{ $config.scheme | quote }}
{{- end }}
{{- if $config.path }}
prometheus.io/path: {{ $config.path | quote }}
{{- end }}
{{- if $config.port }}
prometheus.io/port: {{ $config.port | quote }}
{{- end }}
{{- end -}}
::
{{- define "helm-toolkit.snippets.prometheus_pod_annotations" -}}
{{- $config := index . 0 -}}
{{- if $config.scrape }}
prometheus.io/scrape: {{ $config.scrape | quote }}
{{- end }}
{{- if $config.path }}
prometheus.io/path: {{ $config.path | quote }}
{{- end }}
{{- if $config.port }}
prometheus.io/port: {{ $config.port | quote }}
{{- end }}
{{- end -}}
These functions render the following annotations:
- prometheus.io/scrape: Must be set to true for Prometheus to scrape target
- prometheus.io/scheme: Overrides scheme used to scrape target if not http
- prometheus.io/path: Overrides path used to scrape target metrics if not /metrics
- prometheus.io/port: Overrides port to scrape metrics on if not service's default port
Each chart that can be targeted for monitoring by Prometheus has a prometheus
section under a monitoring tree in the chart's values.yaml, and Prometheus
monitoring is disabled by default for those services. Example values for the
required entries can be found in the following monitoring configuration for the
prometheus-node-exporter chart:
::
monitoring:
prometheus:
enabled: false
node_exporter:
scrape: true
If the prometheus.enabled key is set to true, the annotations are set on the
targeted service or pod as the condition for applying the annotations evaluates
to true. For example:
::
{{- $prometheus_annotations := $envAll.Values.monitoring.prometheus.node_exporter }}
---
apiVersion: v1
kind: Service
metadata:
name: {{ tuple "node_metrics" "internal" . | include "helm-toolkit.endpoints.hostname_short_endpoint_lookup" }}
labels:
{{ tuple $envAll "node_exporter" "metrics" | include "helm-toolkit.snippets.kubernetes_metadata_labels" | indent 4 }}
annotations:
{{- if .Values.monitoring.prometheus.enabled }}
{{ tuple $prometheus_annotations | include "helm-toolkit.snippets.prometheus_service_annotations" | indent 4 }}
{{- end }}
Kubelet, API Server, and cAdvisor
---------------------------------
The Prometheus chart includes scrape target configurations for the kubelet, the
Kubernetes API servers, and cAdvisor. These targets are configured based on
a kubeadm deployed Kubernetes cluster, as OpenStack-Helm uses kubeadm to deploy
Kubernetes in the gates. These configurations may need to change based on your
chosen method of deployment. Please note the cAdvisor metrics will not be
captured if the kubelet was started with the following flag:
::
--cadvisor-port=0
To enable the gathering of the kubelet's custom metrics, the following flag must
be set:
::
--enable-custom-metrics
Installation
------------
The Prometheus chart can be installed with the following command:
.. code-block:: bash
helm install --namespace=openstack local/prometheus --name=prometheus
The above command results in a Prometheus deployment configured to automatically
discover services with the necessary annotations for scraping, configured to
gather metrics on the kubelet, the Kubernetes API servers, and cAdvisor.
Extending Prometheus
--------------------
Prometheus can target various exporters to gather metrics related to specific
applications to extend visibility into an OpenStack-Helm deployment. Currently,
openstack-helm-infra contains charts for:
- prometheus-kube-state-metrics: Provides additional Kubernetes metrics
- prometheus-node-exporter: Provides metrics for nodes and linux kernels
- prometheus-openstack-metrics-exporter: Provides metrics for OpenStack services
Kube-State-Metrics
~~~~~~~~~~~~~~~~~~
The prometheus-kube-state-metrics chart provides metrics for Kubernetes objects
as well as metrics for kube-scheduler and kube-controller-manager. Information
on the specific metrics available via the kube-state-metrics service can be
found in the kube-state-metrics_ documentation.
The prometheus-kube-state-metrics chart can be installed with the following:
.. code-block:: bash
helm install --namespace=kube-system local/prometheus-kube-state-metrics --name=prometheus-kube-state-metrics
.. _kube-state-metrics: https://github.com/kubernetes/kube-state-metrics/tree/master/Documentation
Node Exporter
~~~~~~~~~~~~~
The prometheus-node-exporter chart provides hardware and operating system metrics
exposed via Linux kernels. Information on the specific metrics available via
the Node exporter can be found on the Node_exporter_ GitHub page.
The prometheus-node-exporter chart can be installed with the following:
.. code-block:: bash
helm install --namespace=kube-system local/prometheus-node-exporter --name=prometheus-node-exporter
.. _Node_exporter: https://github.com/prometheus/node_exporter
OpenStack Exporter
~~~~~~~~~~~~~~~~~~
The prometheus-openstack-exporter chart provides metrics specific to the
OpenStack services. The exporter's source code can be found here_. While the
metrics provided are by no means comprehensive, they will be expanded upon.
Please note the OpenStack exporter requires the creation of a Keystone user to
successfully gather metrics. To create the required user, the chart uses the
same keystone user management job the OpenStack service charts use.
The prometheus-openstack-exporter chart can be installed with the following:
.. code-block:: bash
helm install --namespace=openstack local/prometheus-openstack-exporter --name=prometheus-openstack-exporter
.. _here: https://github.com/att-comdev/openstack-metrics-collector
Other exporters
~~~~~~~~~~~~~~~
Certain charts in OpenStack-Helm include templates for application-specific
Prometheus exporters, which keeps the monitoring of those services tightly coupled
to the chart. The templates for these exporters can be found in the monitoring
subdirectory in the chart. These exporters are disabled by default, and can be
enabled by setting the appropriate flag in the monitoring.prometheus key of the
chart's values.yaml file. The charts containing exporters include:
- Elasticsearch_
- RabbitMQ_
- MariaDB_
- Memcached_
- Fluentd_
- Postgres_
.. _Elasticsearch: https://github.com/justwatchcom/elasticsearch_exporter
.. _RabbitMQ: https://github.com/kbudde/rabbitmq_exporter
.. _MariaDB: https://github.com/prometheus/mysqld_exporter
.. _Memcached: https://github.com/prometheus/memcached_exporter
.. _Fluentd: https://github.com/V3ckt0r/fluentd_exporter
.. _Postgres: https://github.com/wrouesnel/postgres_exporter
Ceph
~~~~
Starting with Luminous, Ceph can export metrics with ceph-mgr prometheus module.
This module can be enabled in Ceph's values.yaml under the ceph_mgr_enabled_plugins
key by appending prometheus to the list of enabled modules. After enabling the
prometheus module, metrics can be scraped on the ceph-mgr service endpoint. This
relies on the Prometheus annotations attached to the ceph-mgr service template, and
these annotations can be modified in the endpoints section of Ceph's values.yaml
file. Information on the specific metrics available via the prometheus module
can be found in the Ceph prometheus_ module documentation.
.. _prometheus: http://docs.ceph.com/docs/master/mgr/prometheus/
Prometheus Dashboard
--------------------
Prometheus includes a dashboard that can be accessed via the accessible
Prometheus endpoint (NodePort or otherwise). This dashboard will give you a
view of your scrape targets' state, the configuration values for Prometheus's
scrape jobs and command line flags, a view of any alerts triggered based on the
defined rules, and a means for using PromQL to query scraped metrics. The
Prometheus dashboard is a useful tool for verifying Prometheus is configured
appropriately and to verify the status of any services targeted for scraping via
the Prometheus service discovery annotations.
Rules Configuration
-------------------
Prometheus provides a querying language that can operate on defined rules which
allow for the generation of alerts on specific metrics. The Prometheus chart in
openstack-helm-infra defines these rules via the values.yaml file. By defining
these in the values file, it allows operators flexibility to provide specific
rules via overrides at installation. The following rules keys are provided:
::
values:
conf:
rules:
alertmanager:
etcd3:
kube_apiserver:
kube_controller_manager:
kubelet:
kubernetes:
rabbitmq:
mysql:
ceph:
openstack:
custom:
These provided keys provide recording and alert rules for all infrastructure
components of an OpenStack-Helm deployment. If you wish to exclude rules for a
component, leave the tree empty in an overrides file. To read more
about Prometheus recording and alert rules definitions, please see the official
Prometheus recording_ and alert_ rules documentation.
.. _recording: https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/
.. _alert: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
Note: Prometheus releases prior to 2.0 used gotpl to define rules. Prometheus
2.0 changed the rules format to YAML, making them much easier to read. The
Prometheus chart in openstack-helm-infra uses Prometheus 2.0 by default to take
advantage of changes to the underlying storage layer and the handling of stale
data. The chart will not support overrides for Prometheus versions below 2.0,
as the command line flags for the service changed between versions.
The wide range of exporters included in OpenStack-Helm coupled with the ability
to define rules with configuration overrides allows for the addition of custom
alerting and recording rules to fit an operator's monitoring needs. Adding new
rules or modifying existing rules require overrides for either an existing key
under conf.rules or the addition of a new key under conf.rules. The addition
of custom rules can be used to define complex checks that can be extended for
determining the liveliness or health of infrastructure components.

1
doc/source/readme.rst Normal file
View File

@ -0,0 +1 @@
.. include:: ../../README.rst