From eab9ca05a6a9172602bd9e0a8b02c97f0ee6b466 Mon Sep 17 00:00:00 2001 From: Steve Wilkerson Date: Tue, 15 May 2018 15:14:14 -0500 Subject: [PATCH] Foundation for LMA docs This begins building documentation for the LMA services included in openstack-helm-infra. This includes documentation for: kibana, elasticsearch, fluent-logging, grafana, prometheus, and nagios Change-Id: Iaa24be04748e76fabca998972398802e7e921ef1 Signed-off-by: Steve Wilkerson --- doc/source/index.rst | 4 +- doc/source/logging/elasticsearch.rst | 196 ++++++++++++++ doc/source/logging/fluent-logging.rst | 279 ++++++++++++++++++++ doc/source/logging/index.rst | 11 + doc/source/logging/kibana.rst | 76 ++++++ doc/source/monitoring/grafana.rst | 89 +++++++ doc/source/monitoring/index.rst | 11 + doc/source/monitoring/nagios.rst | 365 ++++++++++++++++++++++++++ doc/source/monitoring/prometheus.rst | 338 ++++++++++++++++++++++++ doc/source/readme.rst | 1 + 10 files changed, 1369 insertions(+), 1 deletion(-) create mode 100644 doc/source/logging/elasticsearch.rst create mode 100644 doc/source/logging/fluent-logging.rst create mode 100644 doc/source/logging/index.rst create mode 100644 doc/source/logging/kibana.rst create mode 100644 doc/source/monitoring/grafana.rst create mode 100644 doc/source/monitoring/index.rst create mode 100644 doc/source/monitoring/nagios.rst create mode 100644 doc/source/monitoring/prometheus.rst create mode 100644 doc/source/readme.rst diff --git a/doc/source/index.rst b/doc/source/index.rst index 936eb8913..489813aad 100644 --- a/doc/source/index.rst +++ b/doc/source/index.rst @@ -8,7 +8,9 @@ Contents: install/index testing/index - + monitoring/index + logging/index + readme Indices and Tables ================== diff --git a/doc/source/logging/elasticsearch.rst b/doc/source/logging/elasticsearch.rst new file mode 100644 index 000000000..af0e7a515 --- /dev/null +++ b/doc/source/logging/elasticsearch.rst @@ -0,0 +1,196 @@ +Elasticsearch +============= + +The Elasticsearch chart in openstack-helm-infra provides a distributed data +store to index and analyze logs generated from the OpenStack-Helm services. +The chart contains templates for: + +- Elasticsearch client nodes +- Elasticsearch data nodes +- Elasticsearch master nodes +- An Elasticsearch exporter for providing cluster metrics to Prometheus +- A cronjob for Elastic Curator to manage data indices + +Authentication +-------------- + +The Elasticsearch deployment includes a sidecar container that runs an Apache +reverse proxy to add authentication capabilities for Elasticsearch. The +username and password are configured under the Elasticsearch entry in the +endpoints section of the chart's values.yaml. + +The configuration for Apache can be found under the conf.httpd key, and uses a +helm-toolkit function that allows for including gotpl entries in the template +directly. This allows the use of other templates, like the endpoint lookup +function templates, directly in the configuration for Apache. + +Elasticsearch Service Configuration +----------------------------------- + +The Elasticsearch service configuration file can be modified with a combination +of pod environment variables and entries in the values.yaml file. Elasticsearch +does not require much configuration out of the box, and the default values for +these configuration settings are meant to provide a highly available cluster by +default. + +The vital entries in this configuration file are: + +- path.data: The path at which to store the indexed data +- path.repo: The location of any snapshot repositories to backup indexes +- bootstrap.memory_lock: Ensures none of the JVM is swapped to disk +- discovery.zen.minimum_master_nodes: Minimum required masters for the cluster + +The bootstrap.memory_lock entry ensures none of the JVM will be swapped to disk +during execution, and setting this value to false will negatively affect the +health of your Elasticsearch nodes. The discovery.zen.minimum_master_nodes flag +registers the minimum number of masters required for your Elasticsearch cluster +to register as healthy and functional. + +To read more about Elasticsearch's configuration file, please see the official +documentation_. + +.. _documentation: https://www.elastic.co/guide/en/elasticsearch/reference/current/important-settings.html + +Elastic Curator +--------------- + +The Elasticsearch chart contains a cronjob to run Elastic Curator at specified +intervals to manage the lifecycle of your indices. Curator can perform: + +- Take and send a snapshot of your indexes to a specified snapshot repository +- Delete indexes older than a specified length of time +- Restore indexes with previous index snapshots +- Reindex an index into a new or preexisting index + +The full list of supported Curator actions can be found in the actions_ section of +the official Curator documentation. The list of options available for those +actions can be found in the options_ section of the Curator documentation. + +.. _actions: https://www.elastic.co/guide/en/elasticsearch/client/curator/current/actions.html +.. _options: https://www.elastic.co/guide/en/elasticsearch/client/curator/current/options.html + +Curator's configuration is handled via entries in Elasticsearch's values.yaml +file and must be overridden to achieve your index lifecycle management +needs. Please note that any unused field should be left blank, as an entry of +"None" will result in an exception, as Curator will read it as a Python NoneType +insead of a value of None. + +The section for Curator's service configuration can be found at: + +:: + + conf: + curator: + config: + client: + hosts: + - elasticsearch-logging + port: 9200 + url_prefix: + use_ssl: False + certificate: + client_cert: + client_key: + ssl_no_validate: False + http_auth: + timeout: 30 + master_only: False + logging: + loglevel: INFO + logfile: + logformat: default + blacklist: ['elasticsearch', 'urllib3'] + +Curator's actions are configured in the following section: + +:: + + conf: + curator: + action_file: + actions: + 1: + action: delete_indices + description: "Clean up ES by deleting old indices" + options: + timeout_override: + continue_if_exception: False + ignore_empty_list: True + disable_action: True + filters: + - filtertype: age + source: name + direction: older + timestring: '%Y.%m.%d' + unit: days + unit_count: 30 + field: + stats_result: + epoch: + exclude: False + +The Elasticsearch chart contains commented example actions for deleting and +snapshotting indexes older 30 days. Please note these actions are provided as a +reference and are disabled by default to avoid any unexpected behavior against +your indexes. + +Elasticsearch Exporter +---------------------- + +The Elasticsearch chart contains templates for an exporter to provide metrics +for Prometheus. These metrics provide insight into the performance and overall +health of your Elasticsearch cluster. Please note monitoring for Elasticsearch +is disabled by default, and must be enabled with the following override: + + +:: + + monitoring: + prometheus: + enabled: true + + +The Elasticsearch exporter uses the same service annotations as the other +exporters, and no additional configuration is required for Prometheus to target +the Elasticsearch exporter for scraping. The Elasticsearch exporter is +configured with command line flags, and the flags' default values can be found +under the following key in the values.yaml file: + +:: + + conf: + prometheus_elasticsearch_exporter: + es: + all: true + timeout: 20s + +The configuration keys configure the following behaviors: + +- es.all: Gather information from all nodes, not just the connecting node +- es.timeout: Timeout for metrics queries + +More information about the Elasticsearch exporter can be found on the exporter's +GitHub_ page. + +.. _GitHub: https://github.com/justwatchcom/elasticsearch_exporter + + +Snapshot Repositories +--------------------- + +Before Curator can store snapshots in a specified repository, Elasticsearch must +register the configured repository. To achieve this, the Elasticsearch chart +contains a job for registering an s3 snapshot repository backed by radosgateway. +This job is disabled by default as the curator actions for snapshots are +disabled by default. To enable the snapshot job, the +conf.elasticsearch.snapshots.enabled flag must be set to true. The following +configuration keys are relevant: + +- conf.elasticsearch.snapshots.enabled: Enable snapshot repositories +- conf.elasticsearch.snapshots.bucket: Name of the RGW s3 bucket to use +- conf.elasticsearch.snapshots.repositories: Name of repositories to create + +More information about Elasticsearch repositories can be found in the official +Elasticsearch snapshot_ documentation: + +.. _snapshot: https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-snapshots.html#_repositories diff --git a/doc/source/logging/fluent-logging.rst b/doc/source/logging/fluent-logging.rst new file mode 100644 index 000000000..b3ea41899 --- /dev/null +++ b/doc/source/logging/fluent-logging.rst @@ -0,0 +1,279 @@ +Fluent-logging +=============== + +The fluent-logging chart in openstack-helm-infra provides the base for a +centralized logging platform for OpenStack-Helm. The chart combines two +services, Fluentbit and Fluentd, to gather logs generated by the services, +filter on or add metadata to logged events, then forward them to Elasticsearch +for indexing. + +Fluentbit +--------- + +Fluentbit runs as a log-collecting component on each host in the cluster, and +can be configured to target specific log locations on the host. The Fluentbit_ +configuration schema can be found on the official Fluentbit website. + +.. _Fluentbit: http://fluentbit.io/documentation/0.12/configuration/schema.html + +Fluentbit provides a set of plug-ins for ingesting and filtering various log +types. These plug-ins include: + +- Tail: Tails a defined file for logged events +- Kube: Adds Kubernetes metadata to a logged event +- Systemd: Provides ability to collect logs from the journald daemon +- Syslog: Provides the ability to collect logs from a Unix socket (TCP or UDP) + +The complete list of plugins can be found in the configuration_ section of the +Fluentbit documentation. + +.. _configuration: http://fluentbit.io/documentation/current/configuration/ + +Fluentbit uses parsers to turn unstructured log entries into structured entries +to make processing and filtering events easier. The two formats supported are +JSON maps and regular expressions. More information about Fluentbit's parsing +abilities can be found in the parsers_ section of Fluentbit's documentation. + +.. _parsers: http://fluentbit.io/documentation/current/parser/ + +Fluentbit's service and parser configurations are defined via the values.yaml +file, which allows for custom definitions of inputs, filters and outputs for +your logging needs. +Fluentbit's configuration can be found under the following key: + +:: + + conf: + fluentbit: + - service: + header: service + Flush: 1 + Daemon: Off + Log_Level: info + Parsers_File: parsers.conf + - containers_tail: + header: input + Name: tail + Tag: kube.* + Path: /var/log/containers/*.log + Parser: docker + DB: /var/log/flb_kube.db + Mem_Buf_Limit: 5MB + - kube_filter: + header: filter + Name: kubernetes + Match: kube.* + Merge_JSON_Log: On + - fluentd_output: + header: output + Name: forward + Match: "*" + Host: ${FLUENTD_HOST} + Port: ${FLUENTD_PORT} + +Fluentbit is configured by default to capture logs at the info log level. To +change this, override the Log_Level key with the appropriate levels, which are +documented in Fluentbit's configuration_. + +Fluentbit's parser configuration can be found under the following key: + +:: + + conf: + parsers: + - docker: + header: parser + Name: docker + Format: json + Time_Key: time + Time_Format: "%Y-%m-%dT%H:%M:%S.%L" + Time_Keep: On + +The values for the fluentbit and parsers keys are consumed by a fluent-logging +helper template that produces the appropriate configurations for the relevant +sections. Each list item (keys prefixed with a '-') represents a section in the +configuration files, and the arbitrary name of the list item should represent a +logical description of the section defined. The header key represents the type +of definition (filter, input, output, service or parser), and the remaining +entries will be rendered as space delimited configuration keys and values. For +example, the definitions above would result in the following: + +:: + + [SERVICE] + Daemon false + Flush 1 + Log_Level info + Parsers_File parsers.conf + [INPUT] + DB /var/log/flb_kube.db + Mem_Buf_Limit 5MB + Name tail + Parser docker + Path /var/log/containers/*.log + Tag kube.* + [FILTER] + Match kube.* + Merge_JSON_Log true + Name kubernetes + [OUTPUT] + Host ${FLUENTD_HOST} + Match * + Name forward + Port ${FLUENTD_PORT} + [PARSER] + Format json + Name docker + Time_Format %Y-%m-%dT%H:%M:%S.%L + Time_Keep true + Time_Key time + +Fluentd +------- + +Fluentd runs as a forwarding service that receives event entries from Fluentbit +and routes them to the appropriate destination. By default, Fluentd will route +all entries received from Fluentbit to Elasticsearch for indexing. The +Fluentd_ configuration schema can be found at the official Fluentd website. + +.. _Fluentd: https://docs.fluentd.org/v0.12/articles/config-file + +Fluentd's configuration is handled in the values.yaml file in fluent-logging. +Similar to Fluentbit, configuration overrides provide flexibility in defining +custom routes for tagged log events. The configuration can be found under the +following key: + +:: + + conf: + fluentd: + - fluentbit_forward: + header: source + type: forward + port: "#{ENV['FLUENTD_PORT']}" + bind: 0.0.0.0 + - elasticsearch: + header: match + type: elasticsearch + expression: "**" + include_tag_key: true + host: "#{ENV['ELASTICSEARCH_HOST']}" + port: "#{ENV['ELASTICSEARCH_PORT']}" + logstash_format: true + buffer_chunk_limit: 10M + buffer_queue_limit: 32 + flush_interval: "20" + max_retry_wait: 300 + disable_retry_limit: "" + +The values for the fluentd keys are consumed by a fluent-logging helper template +that produces appropriate configurations for each directive desired. The list +items (keys prefixed with a '-') represent sections in the configuration file, +and the name of each list item should represent a logical description of the +section defined. The header key represents the type of definition (name of the +fluentd plug-in used), and the expression key is used when the plug-in requires +a pattern to match against (example: matches on certain input patterns). The +remaining entries will be rendered as space delimited configuration keys and +values. For example, the definition above would result in the following: + +:: + + + bind 0.0.0.0 + port "#{ENV['FLUENTD_PORT']}" + @type forward + + + buffer_chunk_limit 10M + buffer_queue_limit 32 + disable_retry_limit + flush_interval 20s + host "#{ENV['ELASTICSEARCH_HOST']}" + include_tag_key true + logstash_format true + max_retry_wait 300 + port "#{ENV['ELASTICSEARCH_PORT']}" + @type elasticsearch + + +Some fluentd plug-ins require nested definitions. The fluentd helper template +can handle these definitions with the following structure: + +:: + + conf: + td_agent: + - fluentbit_forward: + header: source + type: forward + port: "#{ENV['FLUENTD_PORT']}" + bind: 0.0.0.0 + - log_transformer: + header: filter + type: record_transformer + expression: "foo.bar" + inner_def: + - record_transformer: + header: record + hostname: my_host + tag: my_tag + +In this example, the my_transformer list will generate a nested configuration +entry in the log_transformer section. The nested definitions are handled by +supplying a list as the value for an arbitrary key, and the list value will +indicate the entry should be handled as a nested definition. The helper +template will render the above example key/value pairs as the following: + +:: + + + bind 0.0.0.0 + port "#{ENV['FLUENTD_PORT']}" + @type forward + + + + hostname my_host + tag my_tag + + @type record_transformer + + +Fluentd Exporter +---------------------- + +The fluent-logging chart contains templates for an exporter to provide metrics +for Fluentd. These metrics provide insight into Fluentd's performance. Please +note monitoring for Fluentd is disabled by default, and must be enabled with the +following override: + +:: + + monitoring: + prometheus: + enabled: true + + +The Fluentd exporter uses the same service annotations as the other exporters, +and no additional configuration is required for Prometheus to target the +Fluentd exporter for scraping. The Fluentd exporter is configured with command +line flags, and the flags' default values can be found under the following key +in the values.yaml file: + +:: + + conf: + fluentd_exporter: + log: + format: "logger:stdout?json=true" + level: "info" + +The configuration keys configure the following behaviors: + +- log.format: Define the logger used and format of the output +- log.level: Log level for the exporter to use + +More information about the Fluentd exporter can be found on the exporter's +GitHub_ page. + +.. _GitHub: https://github.com/V3ckt0r/fluentd_exporter diff --git a/doc/source/logging/index.rst b/doc/source/logging/index.rst new file mode 100644 index 000000000..176293e0c --- /dev/null +++ b/doc/source/logging/index.rst @@ -0,0 +1,11 @@ +OpenStack-Helm Logging +====================== + +Contents: + +.. toctree:: + :maxdepth: 2 + + elasticsearch + fluent-logging + kibana diff --git a/doc/source/logging/kibana.rst b/doc/source/logging/kibana.rst new file mode 100644 index 000000000..141d80dae --- /dev/null +++ b/doc/source/logging/kibana.rst @@ -0,0 +1,76 @@ +Kibana +====== + +The Kibana chart in OpenStack-Helm Infra provides visualization for logs indexed +into Elasticsearch. These visualizations provide the means to view logs captured +from services deployed in cluster and targeted for collection by Fluentbit. + +Authentication +-------------- + +The Kibana deployment includes a sidecar container that runs an Apache reverse +proxy to add authentication capabilities for Kibana. The username and password +are configured under the Kibana entry in the endpoints section of the chart's +values.yaml. + +The configuration for Apache can be found under the conf.httpd key, and uses a +helm-toolkit function that allows for including gotpl entries in the template +directly. This allows the use of other templates, like the endpoint lookup +function templates, directly in the configuration for Apache. + +Configuration +------------- + +Kibana's configuration is driven by the chart's values.yaml file. The configuration +options are found under the following keys: + +:: + + conf: + elasticsearch: + pingTimeout: 1500 + preserveHost: true + requestTimeout: 30000 + shardTimeout: 0 + startupTimeout: 5000 + il8n: + defaultLocale: en + kibana: + defaultAppId: discover + index: .kibana + logging: + quiet: false + silent: false + verbose: false + ops: + interval: 5000 + server: + host: localhost + maxPayloadBytes: 1048576 + port: 5601 + ssl: + enabled: false + +The case of the sub-keys is important as these values are injected into +Kibana's configuration configmap with the toYaml function. More information on +the configuration options and available settings can be found in the official +Kibana documentation_. + +.. _documentation: https://www.elastic.co/guide/en/kibana/current/settings.html + +Installation +------------ + +.. code_block: bash + +helm install --namespace= local/kibana --name=kibana + +Setting Time Field +------------------ + +For Kibana to successfully read the logs from Elasticsearch's indexes, the time +field will need to be manually set after Kibana has successfully deployed. Upon +visiting the Kibana dashboard for the first time, a prompt will appear to choose the +time field with a drop down menu. The default time field for Elasticsearch indexes +is '@timestamp'. Once this field is selected, the default view for querying log entries +can be found by selecting the "Discover" diff --git a/doc/source/monitoring/grafana.rst b/doc/source/monitoring/grafana.rst new file mode 100644 index 000000000..61d1f0a72 --- /dev/null +++ b/doc/source/monitoring/grafana.rst @@ -0,0 +1,89 @@ +Grafana +======= + +The Grafana chart in OpenStack-Helm Infra provides default dashboards for the +metrics gathered with Prometheus. The default dashboards include visualizations +for metrics on: Ceph, Kubernetes, nodes, containers, MySQL, RabbitMQ, and +OpenStack. + +Configuration +------------- + +Grafana +~~~~~~~ + +Grafana's configuration is driven with the chart's values.YAML file, and the +relevant configuration entries are under the following key: + +:: + + conf: + grafana: + paths: + server: + database: + session: + security: + users: + log: + log.console: + dashboards.json: + grafana_net: + +These keys correspond to sections in the grafana.ini configuration file, and the +to_ini helm-toolkit function will render these values into the appropriate +format in grafana.ini. The list of options for these keys can be found in the +official Grafana configuration_ documentation. + +.. _configuration: http://docs.grafana.org/installation/configuration/ + +Prometheus Data Source +~~~~~~~~~~~~~~~~~~~~~~ + +Grafana requires configured data sources for gathering metrics for display in +its dashboards. The configuration options for datasources are found under the +following key in Grafana's values.YAML file: + +:: + + conf: + provisioning: + datasources; + monitoring: + name: prometheus + type: prometheus + access: proxy + orgId: 1 + editable: true + basicAuth: true + +The Grafana chart will use the keys under each entry beneath +.conf.provisioning.datasources as inputs to a helper template that will render +the appropriate configuration for the data source. The key for each data source +(monitoring in the above example) should map to an entry in the endpoints +section in the chart's values.yaml, as the data source's URL and authentication +credentials will be populated by the values defined in the defined endpoint. + +.. _sources: http://docs.grafana.org/features/datasources/ + +Dashboards +~~~~~~~~~~ + +Grafana adds dashboards during installation with dashboards defined in YAML under +the following key: + +:: + + conf: + dashboards: + + +These YAML definitiions are transformed to JSON are added to Grafana's +configuration configmap and mounted to the Grafana pods dynamically, allowing for +flexibility in defining and adding custom dashboards to Grafana. Dashboards can +be added by inserting a new key along with a YAML dashboard definition as the +value. Additional dashboards can be found by searching on Grafana's dashboards_ +page or you can define your own. A json-to-YAML tool, such as json2yaml_ , will +help transform any custom or new dashboards from JSON to YAML. + +.. _json2yaml: https://www.json2yaml.com/ diff --git a/doc/source/monitoring/index.rst b/doc/source/monitoring/index.rst new file mode 100644 index 000000000..aa87e305c --- /dev/null +++ b/doc/source/monitoring/index.rst @@ -0,0 +1,11 @@ +OpenStack-Helm Monitoring +========================= + +Contents: + +.. toctree:: + :maxdepth: 2 + + grafana + prometheus + nagios diff --git a/doc/source/monitoring/nagios.rst b/doc/source/monitoring/nagios.rst new file mode 100644 index 000000000..af970cf6b --- /dev/null +++ b/doc/source/monitoring/nagios.rst @@ -0,0 +1,365 @@ +Nagios +====== + +The Nagios chart in openstack-helm-infra can be used to provide an alarming +service that's tightly coupled to an OpenStack-Helm deployment. The Nagios +chart uses a custom Nagios core image that includes plugins developed to query +Prometheus directly for scraped metrics and triggered alarms, query the Ceph +manager endpoints directly to determine the health of a Ceph cluster, and to +query Elasticsearch for logged events that meet certain criteria (experimental). + +Authentication +-------------- + +The Nagios deployment includes a sidecar container that runs an Apache reverse +proxy to add authentication capabilities for Nagios. The username and password +are configured under the nagios entry in the endpoints section of the chart's +values.yaml. + +The configuration for Apache can be found under the conf.httpd key, and uses a +helm-toolkit function that allows for including gotpl entries in the template +directly. This allows the use of other templates, like the endpoint lookup +function templates, directly in the configuration for Apache. + +Image Plugins +------------- + +The Nagios image used contains custom plugins that can be used for the defined +service check commands. These plugins include: + +- check_prometheus_metric.py: Query Prometheus for a specific metric and value +- check_exporter_health_metric.sh: Nagios plugin to query prometheus exporter +- check_rest_get_api.py: Check REST API status +- check_update_prometheus_hosts.py: Queries Prometheus, updates Nagios config +- query_prometheus_alerts.py: Nagios plugin to query prometheus ALERTS metric + +More information about the Nagios image and plugins can be found here_. + +.. _here: https://github.com/att-comdev/nagios + + +Nagios Service Configuration +---------------------------- + +The Nagios service is configured via the following section in the chart's +values file: + +:: + + conf: + nagios: + nagios: + log_file: /opt/nagios/var/log/nagios.log + cfg_file: + - /opt/nagios/etc/nagios_objects.cfg + - /opt/nagios/etc/objects/commands.cfg + - /opt/nagios/etc/objects/contacts.cfg + - /opt/nagios/etc/objects/timeperiods.cfg + - /opt/nagios/etc/objects/templates.cfg + - /opt/nagios/etc/objects/prometheus_discovery_objects.cfg + object_cache_file: /opt/nagios/var/objects.cache + precached_object_file: /opt/nagios/var/objects.precache + resource_file: /opt/nagios/etc/resource.cfg + status_file: /opt/nagios/var/status.dat + status_update_interval: 10 + nagios_user: nagios + nagios_group: nagios + check_external_commands: 1 + command_file: /opt/nagios/var/rw/nagios.cmd + lock_file: /var/run/nagios.lock + temp_file: /opt/nagios/var/nagios.tmp + temp_path: /tmp + event_broker_options: -1 + log_rotation_method: d + log_archive_path: /opt/nagios/var/log/archives + use_syslog: 1 + log_service_retries: 1 + log_host_retries: 1 + log_event_handlers: 1 + log_initial_states: 0 + log_current_states: 1 + log_external_commands: 1 + log_passive_checks: 1 + service_inter_check_delay_method: s + max_service_check_spread: 30 + service_interleave_factor: s + host_inter_check_delay_method: s + max_host_check_spread: 30 + max_concurrent_checks: 60 + check_result_reaper_frequency: 10 + max_check_result_reaper_time: 30 + check_result_path: /opt/nagios/var/spool/checkresults + max_check_result_file_age: 3600 + cached_host_check_horizon: 15 + cached_service_check_horizon: 15 + enable_predictive_host_dependency_checks: 1 + enable_predictive_service_dependency_checks: 1 + soft_state_dependencies: 0 + auto_reschedule_checks: 0 + auto_rescheduling_interval: 30 + auto_rescheduling_window: 180 + service_check_timeout: 60 + host_check_timeout: 60 + event_handler_timeout: 60 + notification_timeout: 60 + ocsp_timeout: 5 + perfdata_timeout: 5 + retain_state_information: 1 + state_retention_file: /opt/nagios/var/retention.dat + retention_update_interval: 60 + use_retained_program_state: 1 + use_retained_scheduling_info: 1 + retained_host_attribute_mask: 0 + retained_service_attribute_mask: 0 + retained_process_host_attribute_mask: 0 + retained_process_service_attribute_mask: 0 + retained_contact_host_attribute_mask: 0 + retained_contact_service_attribute_mask: 0 + interval_length: 1 + check_workers: 4 + check_for_updates: 1 + bare_update_check: 0 + use_aggressive_host_checking: 0 + execute_service_checks: 1 + accept_passive_service_checks: 1 + execute_host_checks: 1 + accept_passive_host_checks: 1 + enable_notifications: 1 + enable_event_handlers: 1 + process_performance_data: 0 + obsess_over_services: 0 + obsess_over_hosts: 0 + translate_passive_host_checks: 0 + passive_host_checks_are_soft: 0 + check_for_orphaned_services: 1 + check_for_orphaned_hosts: 1 + check_service_freshness: 1 + service_freshness_check_interval: 60 + check_host_freshness: 0 + host_freshness_check_interval: 60 + additional_freshness_latency: 15 + enable_flap_detection: 1 + low_service_flap_threshold: 5.0 + high_service_flap_threshold: 20.0 + low_host_flap_threshold: 5.0 + high_host_flap_threshold: 20.0 + date_format: us + use_regexp_matching: 1 + use_true_regexp_matching: 0 + daemon_dumps_core: 0 + use_large_installation_tweaks: 0 + enable_environment_macros: 0 + debug_level: 0 + debug_verbosity: 1 + debug_file: /opt/nagios/var/nagios.debug + max_debug_file_size: 1000000 + allow_empty_hostgroup_assignment: 1 + illegal_macro_output_chars: "`~$&|'<>\"" + +Nagios CGI Configuration +------------------------ + +The Nagios CGI configuration is defined via the following section in the chart's +values file: + +:: + + conf: + nagios: + cgi: + main_config_file: /opt/nagios/etc/nagios.cfg + physical_html_path: /opt/nagios/share + url_html_path: /nagios + show_context_help: 0 + use_pending_states: 1 + use_authentication: 0 + use_ssl_authentication: 0 + authorized_for_system_information: "*" + authorized_for_configuration_information: "*" + authorized_for_system_commands: nagiosadmin + authorized_for_all_services: "*" + authorized_for_all_hosts: "*" + authorized_for_all_service_commands: "*" + authorized_for_all_host_commands: "*" + default_statuswrl_layout: 4 + ping_syntax: /bin/ping -n -U -c 5 $HOSTADDRESS$ + refresh_rate: 90 + result_limit: 100 + escape_html_tags: 1 + action_url_target: _blank + notes_url_target: _blank + lock_author_names: 1 + navbar_search_for_addresses: 1 + navbar_search_for_aliases: 1 + +Nagios Host Configuration +------------------------- + +The Nagios chart includes a single host definition for the Prometheus instance +queried for metrics. The host definition can be found under the following +values key: + +:: + + conf: + nagios: + hosts: + - prometheus: + use: linux-server + host_name: prometheus + alias: "Prometheus Monitoring" + address: 127.0.0.1 + hostgroups: prometheus-hosts + check_command: check-prometheus-host-alive + +The address for the Prometheus host is defined by the PROMETHEUS_SERVICE +environment variable in the deployment template, which is determined by the +monitoring entry in the Nagios chart's endpoints section. The endpoint is then +available as a macro for Nagios to use in all Prometheus based queries. For +example: + +:: + + - check_prometheus_host_alive: + command_name: check-prometheus-host-alive + command_line: "$USER1$/check_rest_get_api.py --url $USER2$ --warning_response_seconds 5 --critical_response_seconds 10" + +The $USER2$ macro above corresponds to the Prometheus endpoint defined in the +PROMETHEUS_SERVICE environment variable. All checks that use the +prometheus-hosts hostgroup will map back to the Prometheus host defined by this +endpoint. + +Nagios HostGroup Configuration +------------------------------ + +The Nagios chart includes configuration values for defined host groups under the +following values key: + +:: + + conf: + nagios: + host_groups: + - prometheus-hosts: + hostgroup_name: prometheus-hosts + alias: "Prometheus Virtual Host" + - base-os: + hostgroup_name: base-os + alias: "base-os" + +These hostgroups are used to define which group of hosts should be targeted by +a particular nagios check. An example of a check that targets Prometheus for a +specific metric query would be: + +:: + + - check_ceph_monitor_quorum: + use: notifying_service + hostgroup_name: prometheus-hosts + service_description: "CEPH_quorum" + check_command: check_prom_alert!ceph_monitor_quorum_low!CRITICAL- ceph monitor quorum does not exist!OK- ceph monitor quorum exists + check_interval: 60 + +An example of a check that targets all hosts for a base-os type check (memory +usage, latency, etc) would be: + +:: + + - check_memory_usage: + use: notifying_service + service_description: Memory_usage + check_command: check_memory_usage + hostgroup_name: base-os + +These two host groups allow for a wide range of targeted checks for determining +the status of all components of an OpenStack-Helm deployment. + +Nagios Command Configuration +---------------------------- + +The Nagios chart includes configuration values for the command definitions Nagios +will use when executing service checks. These values are found under the +following key: + +:: + + conf: + nagios: + commands: + - send_service_snmp_trap: + command_name: send_service_snmp_trap + command_line: "$USER1$/send_service_trap.sh '$USER8$' '$HOSTNAME$' '$SERVICEDESC$' $SERVICESTATEID$ '$SERVICEOUTPUT$' '$USER4$' '$USER5$'" + - send_host_snmp_trap: + command_name: send_host_snmp_trap + command_line: "$USER1$/send_host_trap.sh '$USER8$' '$HOSTNAME$' $HOSTSTATEID$ '$HOSTOUTPUT$' '$USER4$' '$USER5$'" + - send_service_http_post: + command_name: send_service_http_post + command_line: "$USER1$/send_http_post_event.py --type service --hostname '$HOSTNAME$' --servicedesc '$SERVICEDESC$' --state_id $SERVICESTATEID$ --output '$SERVICEOUTPUT$' --monitoring_hostname '$HOSTNAME$' --primary_url '$USER6$' --secondary_url '$USER7$'" + - send_host_http_post: + command_name: send_host_http_post + command_line: "$USER1$/send_http_post_event.py --type host --hostname '$HOSTNAME$' --state_id $HOSTSTATEID$ --output '$HOSTOUTPUT$' --monitoring_hostname '$HOSTNAME$' --primary_url '$USER6$' --secondary_url '$USER7$'" + - check_prometheus_host_alive: + command_name: check-prometheus-host-alive + command_line: "$USER1$/check_rest_get_api.py --url $USER2$ --warning_response_seconds 5 --critical_response_seconds 10" + +The list of defined commands can be modified with configuration overrides, which +allows for the ability define commands specific to an infrastructure deployment. +These commands can include querying Prometheus for metrics on dependencies for a +service to determine whether an alert should be raised, executing checks on each +host to determine network latency or file system usage, or checking each node +for issues with ntp clock skew. + +Note: Since the conf.nagios.commands key contains a list of the defined commands, +the entire contents of conf.nagios.commands will need to be overridden if +additional commands are desired (due to the immutable nature of lists). + +Nagios Service Check Configuration +---------------------------------- + +The Nagios chart includes configuration values for the service checks Nagios +will execute. These service check commands can be found under the following +key: + +:: + conf: + nagios: + services: + - notifying_service: + name: notifying_service + use: generic-service + flap_detection_enabled: 0 + process_perf_data: 0 + contact_groups: snmp_and_http_notifying_contact_group + check_interval: 60 + notification_interval: 120 + retry_interval: 30 + register: 0 + - check_ceph_health: + use: notifying_service + hostgroup_name: base-os + service_description: "CEPH_health" + check_command: check_ceph_health + check_interval: 300 + - check_hosts_health: + use: generic-service + hostgroup_name: prometheus-hosts + service_description: "Nodes_health" + check_command: check_prom_alert!K8SNodesNotReady!CRITICAL- One or more nodes are not ready. + check_interval: 60 + - check_prometheus_replicas: + use: notifying_service + hostgroup_name: prometheus-hosts + service_description: "Prometheus_replica-count" + check_command: check_prom_alert_with_labels!replicas_unavailable_statefulset!statefulset="prometheus"!statefulset {statefulset} has lesser than configured replicas + check_interval: 60 + +The Nagios service configurations define the checks Nagios will perform. These +checks contain keys for defining: the service type to use, the host group to +target, the description of the service check, the command the check should use, +and the interval at which to trigger the service check. These services can also +be extended to provide additional insight into the overall status of a +particular service. These services also allow the ability to define advanced +checks for determining the overall health and liveness of a service. For +example, a service check could trigger an alarm for the OpenStack services when +Nagios detects that the relevant database and message queue has become +unresponsive. diff --git a/doc/source/monitoring/prometheus.rst b/doc/source/monitoring/prometheus.rst new file mode 100644 index 000000000..446589ee4 --- /dev/null +++ b/doc/source/monitoring/prometheus.rst @@ -0,0 +1,338 @@ +Prometheus +========== + +The Prometheus chart in openstack-helm-infra provides a time series database and +a strong querying language for monitoring various components of OpenStack-Helm. +Prometheus gathers metrics by scraping defined service endpoints or pods at +specified intervals and indexing them in the underlying time series database. + +Authentication +-------------- + +The Prometheus deployment includes a sidecar container that runs an Apache +reverse proxy to add authentication capabilities for Prometheus. The +username and password are configured under the monitoring entry in the endpoints +section of the chart's values.yaml. + +The configuration for Apache can be found under the conf.httpd key, and uses a +helm-toolkit function that allows for including gotpl entries in the template +directly. This allows the use of other templates, like the endpoint lookup +function templates, directly in the configuration for Apache. + +Prometheus Service configuration +-------------------------------- + +The Prometheus service is configured via command line flags set during runtime. +These flags include: setting the configuration file, setting log levels, setting +characteristics of the time series database, and enabling the web admin API for +snapshot support. These settings can be configured via the values tree at: + +:: + + conf: + prometheus: + command_line_flags: + log.level: info + query.max_concurrency: 20 + query.timeout: 2m + storage.tsdb.path: /var/lib/prometheus/data + storage.tsdb.retention: 7d + web.enable_admin_api: false + web.enable_lifecycle: false + +The Prometheus configuration file contains the definitions for scrape targets +and the location of the rules files for triggering alerts on scraped metrics. +The configuration file is defined in the values file, and can be found at: + +:: + + conf: + prometheus: + scrape_configs: | + +By defining the configuration via the values file, an operator can override all +configuration components of the Prometheus deployment at runtime. + +Kubernetes Endpoint Configuration +--------------------------------- + +The Prometheus chart in openstack-helm-infra uses the built-in service discovery +mechanisms for Kubernetes endpoints and pods to automatically configure scrape +targets. Functions added to helm-toolkit allows configuration of these targets +via annotations that can be applied to any service or pod that exposes metrics +for Prometheus, whether a service for an application-specific exporter or an +application that provides a metrics endpoint via its service. The values in +these functions correspond to entries in the monitoring tree under the +prometheus key in a chart's values.yaml file. + + +The functions definitions are below: + +:: + + {{- define "helm-toolkit.snippets.prometheus_service_annotations" -}} + {{- $config := index . 0 -}} + {{- if $config.scrape }} + prometheus.io/scrape: {{ $config.scrape | quote }} + {{- end }} + {{- if $config.scheme }} + prometheus.io/scheme: {{ $config.scheme | quote }} + {{- end }} + {{- if $config.path }} + prometheus.io/path: {{ $config.path | quote }} + {{- end }} + {{- if $config.port }} + prometheus.io/port: {{ $config.port | quote }} + {{- end }} + {{- end -}} + +:: + + {{- define "helm-toolkit.snippets.prometheus_pod_annotations" -}} + {{- $config := index . 0 -}} + {{- if $config.scrape }} + prometheus.io/scrape: {{ $config.scrape | quote }} + {{- end }} + {{- if $config.path }} + prometheus.io/path: {{ $config.path | quote }} + {{- end }} + {{- if $config.port }} + prometheus.io/port: {{ $config.port | quote }} + {{- end }} + {{- end -}} + +These functions render the following annotations: + +- prometheus.io/scrape: Must be set to true for Prometheus to scrape target +- prometheus.io/scheme: Overrides scheme used to scrape target if not http +- prometheus.io/path: Overrides path used to scrape target metrics if not /metrics +- prometheus.io/port: Overrides port to scrape metrics on if not service's default port + +Each chart that can be targeted for monitoring by Prometheus has a prometheus +section under a monitoring tree in the chart's values.yaml, and Prometheus +monitoring is disabled by default for those services. Example values for the +required entries can be found in the following monitoring configuration for the +prometheus-node-exporter chart: + +:: + + monitoring: + prometheus: + enabled: false + node_exporter: + scrape: true + +If the prometheus.enabled key is set to true, the annotations are set on the +targeted service or pod as the condition for applying the annotations evaluates +to true. For example: + +:: + + {{- $prometheus_annotations := $envAll.Values.monitoring.prometheus.node_exporter }} + --- + apiVersion: v1 + kind: Service + metadata: + name: {{ tuple "node_metrics" "internal" . | include "helm-toolkit.endpoints.hostname_short_endpoint_lookup" }} + labels: + {{ tuple $envAll "node_exporter" "metrics" | include "helm-toolkit.snippets.kubernetes_metadata_labels" | indent 4 }} + annotations: + {{- if .Values.monitoring.prometheus.enabled }} + {{ tuple $prometheus_annotations | include "helm-toolkit.snippets.prometheus_service_annotations" | indent 4 }} + {{- end }} + +Kubelet, API Server, and cAdvisor +--------------------------------- + +The Prometheus chart includes scrape target configurations for the kubelet, the +Kubernetes API servers, and cAdvisor. These targets are configured based on +a kubeadm deployed Kubernetes cluster, as OpenStack-Helm uses kubeadm to deploy +Kubernetes in the gates. These configurations may need to change based on your +chosen method of deployment. Please note the cAdvisor metrics will not be +captured if the kubelet was started with the following flag: + +:: + + --cadvisor-port=0 + +To enable the gathering of the kubelet's custom metrics, the following flag must +be set: + +:: + + --enable-custom-metrics + +Installation +------------ + +The Prometheus chart can be installed with the following command: + +.. code-block:: bash + + helm install --namespace=openstack local/prometheus --name=prometheus + +The above command results in a Prometheus deployment configured to automatically +discover services with the necessary annotations for scraping, configured to +gather metrics on the kubelet, the Kubernetes API servers, and cAdvisor. + +Extending Prometheus +-------------------- + +Prometheus can target various exporters to gather metrics related to specific +applications to extend visibility into an OpenStack-Helm deployment. Currently, +openstack-helm-infra contains charts for: + +- prometheus-kube-state-metrics: Provides additional Kubernetes metrics +- prometheus-node-exporter: Provides metrics for nodes and linux kernels +- prometheus-openstack-metrics-exporter: Provides metrics for OpenStack services + +Kube-State-Metrics +~~~~~~~~~~~~~~~~~~ + +The prometheus-kube-state-metrics chart provides metrics for Kubernetes objects +as well as metrics for kube-scheduler and kube-controller-manager. Information +on the specific metrics available via the kube-state-metrics service can be +found in the kube-state-metrics_ documentation. + +The prometheus-kube-state-metrics chart can be installed with the following: + +.. code-block:: bash + + helm install --namespace=kube-system local/prometheus-kube-state-metrics --name=prometheus-kube-state-metrics + +.. _kube-state-metrics: https://github.com/kubernetes/kube-state-metrics/tree/master/Documentation + +Node Exporter +~~~~~~~~~~~~~ + +The prometheus-node-exporter chart provides hardware and operating system metrics +exposed via Linux kernels. Information on the specific metrics available via +the Node exporter can be found on the Node_exporter_ GitHub page. + +The prometheus-node-exporter chart can be installed with the following: + +.. code-block:: bash + + helm install --namespace=kube-system local/prometheus-node-exporter --name=prometheus-node-exporter + +.. _Node_exporter: https://github.com/prometheus/node_exporter + +OpenStack Exporter +~~~~~~~~~~~~~~~~~~ + +The prometheus-openstack-exporter chart provides metrics specific to the +OpenStack services. The exporter's source code can be found here_. While the +metrics provided are by no means comprehensive, they will be expanded upon. + +Please note the OpenStack exporter requires the creation of a Keystone user to +successfully gather metrics. To create the required user, the chart uses the +same keystone user management job the OpenStack service charts use. + +The prometheus-openstack-exporter chart can be installed with the following: + +.. code-block:: bash + + helm install --namespace=openstack local/prometheus-openstack-exporter --name=prometheus-openstack-exporter + +.. _here: https://github.com/att-comdev/openstack-metrics-collector + +Other exporters +~~~~~~~~~~~~~~~ + +Certain charts in OpenStack-Helm include templates for application-specific +Prometheus exporters, which keeps the monitoring of those services tightly coupled +to the chart. The templates for these exporters can be found in the monitoring +subdirectory in the chart. These exporters are disabled by default, and can be +enabled by setting the appropriate flag in the monitoring.prometheus key of the +chart's values.yaml file. The charts containing exporters include: + +- Elasticsearch_ +- RabbitMQ_ +- MariaDB_ +- Memcached_ +- Fluentd_ +- Postgres_ + +.. _Elasticsearch: https://github.com/justwatchcom/elasticsearch_exporter +.. _RabbitMQ: https://github.com/kbudde/rabbitmq_exporter +.. _MariaDB: https://github.com/prometheus/mysqld_exporter +.. _Memcached: https://github.com/prometheus/memcached_exporter +.. _Fluentd: https://github.com/V3ckt0r/fluentd_exporter +.. _Postgres: https://github.com/wrouesnel/postgres_exporter + +Ceph +~~~~ + +Starting with Luminous, Ceph can export metrics with ceph-mgr prometheus module. +This module can be enabled in Ceph's values.yaml under the ceph_mgr_enabled_plugins +key by appending prometheus to the list of enabled modules. After enabling the +prometheus module, metrics can be scraped on the ceph-mgr service endpoint. This +relies on the Prometheus annotations attached to the ceph-mgr service template, and +these annotations can be modified in the endpoints section of Ceph's values.yaml +file. Information on the specific metrics available via the prometheus module +can be found in the Ceph prometheus_ module documentation. + +.. _prometheus: http://docs.ceph.com/docs/master/mgr/prometheus/ + + +Prometheus Dashboard +-------------------- + +Prometheus includes a dashboard that can be accessed via the accessible +Prometheus endpoint (NodePort or otherwise). This dashboard will give you a +view of your scrape targets' state, the configuration values for Prometheus's +scrape jobs and command line flags, a view of any alerts triggered based on the +defined rules, and a means for using PromQL to query scraped metrics. The +Prometheus dashboard is a useful tool for verifying Prometheus is configured +appropriately and to verify the status of any services targeted for scraping via +the Prometheus service discovery annotations. + +Rules Configuration +------------------- + +Prometheus provides a querying language that can operate on defined rules which +allow for the generation of alerts on specific metrics. The Prometheus chart in +openstack-helm-infra defines these rules via the values.yaml file. By defining +these in the values file, it allows operators flexibility to provide specific +rules via overrides at installation. The following rules keys are provided: + +:: + + values: + conf: + rules: + alertmanager: + etcd3: + kube_apiserver: + kube_controller_manager: + kubelet: + kubernetes: + rabbitmq: + mysql: + ceph: + openstack: + custom: + +These provided keys provide recording and alert rules for all infrastructure +components of an OpenStack-Helm deployment. If you wish to exclude rules for a +component, leave the tree empty in an overrides file. To read more +about Prometheus recording and alert rules definitions, please see the official +Prometheus recording_ and alert_ rules documentation. + +.. _recording: https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/ +.. _alert: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/ + +Note: Prometheus releases prior to 2.0 used gotpl to define rules. Prometheus +2.0 changed the rules format to YAML, making them much easier to read. The +Prometheus chart in openstack-helm-infra uses Prometheus 2.0 by default to take +advantage of changes to the underlying storage layer and the handling of stale +data. The chart will not support overrides for Prometheus versions below 2.0, +as the command line flags for the service changed between versions. + +The wide range of exporters included in OpenStack-Helm coupled with the ability +to define rules with configuration overrides allows for the addition of custom +alerting and recording rules to fit an operator's monitoring needs. Adding new +rules or modifying existing rules require overrides for either an existing key +under conf.rules or the addition of a new key under conf.rules. The addition +of custom rules can be used to define complex checks that can be extended for +determining the liveliness or health of infrastructure components. diff --git a/doc/source/readme.rst b/doc/source/readme.rst new file mode 100644 index 000000000..a6210d3d8 --- /dev/null +++ b/doc/source/readme.rst @@ -0,0 +1 @@ +.. include:: ../../README.rst