Merge "Foundation for LMA docs"

2019-01-14 08:10:03 +00:00 · 2019-01-14 08:10:03 +00:00 · 1509383894
commit 1509383894
parent e7d169f62a eab9ca05a6
10 changed files with 1369 additions and 1 deletions
--- a/doc/source/index.rst
+++ b/doc/source/index.rst
@ -8,7 +8,9 @@ Contents:
   install/index
   testing/index
-
+   monitoring/index
   logging/index
   readme
 Indices and Tables
 ==================
--- a/doc/source/logging/elasticsearch.rst
+++ b/doc/source/logging/elasticsearch.rst
@ -0,0 +1,196 @@
 Elasticsearch
 =============
 The Elasticsearch chart in openstack-helm-infra provides a distributed data
 store to index and analyze logs generated from the OpenStack-Helm services.
 The chart contains templates for:
 - Elasticsearch client nodes
 - Elasticsearch data nodes
 - Elasticsearch master nodes
 - An Elasticsearch exporter for providing cluster metrics to Prometheus
 - A cronjob for Elastic Curator to manage data indices
 Authentication
 --------------
 The Elasticsearch deployment includes a sidecar container that runs an Apache
 reverse proxy to add authentication capabilities for Elasticsearch.  The
 username and password are configured under the Elasticsearch entry in the
 endpoints section of the chart's values.yaml.
 The configuration for Apache can be found under the conf.httpd key, and uses a
 helm-toolkit function that allows for including gotpl entries in the template
 directly.  This allows the use of other templates, like the endpoint lookup
 function templates, directly in the configuration for Apache.
 Elasticsearch Service Configuration
 -----------------------------------
 The Elasticsearch service configuration file can be modified with a combination
 of pod environment variables and entries in the values.yaml file.  Elasticsearch
 does not require much configuration out of the box, and the default values for
 these configuration settings are meant to provide a highly available cluster by
 default.
 The vital entries in this configuration file are:
 - path.data:  The path at which to store the indexed data
 - path.repo:  The location of any snapshot repositories to backup indexes
 - bootstrap.memory_lock:  Ensures none of the JVM is swapped to disk
 - discovery.zen.minimum_master_nodes:  Minimum required masters for the cluster
 The bootstrap.memory_lock entry ensures none of the JVM will be swapped to disk
 during execution, and setting this value to false will negatively affect the
 health of your Elasticsearch nodes.  The discovery.zen.minimum_master_nodes flag
 registers the minimum number of masters required for your Elasticsearch cluster
 to register as healthy and functional.
 To read more about Elasticsearch's configuration file, please see the official
 documentation_.
 .. _documentation: https://www.elastic.co/guide/en/elasticsearch/reference/current/important-settings.html
 Elastic Curator
 ---------------
 The Elasticsearch chart contains a cronjob to run Elastic Curator at specified
 intervals to manage the lifecycle of your indices.  Curator can perform:
 - Take and send a snapshot of your indexes to a specified snapshot repository
 - Delete indexes older than a specified length of time
 - Restore indexes with previous index snapshots
 - Reindex an index into a new or preexisting index
 The full list of supported Curator actions can be found in the actions_ section of
 the official Curator documentation.  The list of options available for those
 actions can be found in the options_ section of the Curator documentation.
 .. _actions: https://www.elastic.co/guide/en/elasticsearch/client/curator/current/actions.html
 .. _options: https://www.elastic.co/guide/en/elasticsearch/client/curator/current/options.html
 Curator's configuration is handled via entries in Elasticsearch's values.yaml
 file and must be overridden to achieve your index lifecycle management
 needs.  Please note that any unused field should be left blank, as an entry of
 "None" will result in an exception, as Curator will read it as a Python NoneType
 insead of a value of None.
 The section for Curator's service configuration can be found at:
 ::
    conf:
      curator:
        config:
          client:
            hosts:
              - elasticsearch-logging
            port: 9200
            url_prefix:
            use_ssl: False
            certificate:
            client_cert:
            client_key:
            ssl_no_validate: False
            http_auth:
            timeout: 30
            master_only: False
          logging:
            loglevel: INFO
            logfile:
            logformat: default
            blacklist: ['elasticsearch', 'urllib3']
 Curator's actions are configured in the following section:
 ::
    conf:
      curator:
        action_file:
          actions:
            1:
              action: delete_indices
              description: "Clean up ES by deleting old indices"
              options:
                timeout_override:
                continue_if_exception: False
                ignore_empty_list: True
                disable_action: True
              filters:
              - filtertype: age
                source: name
                direction: older
                timestring: '%Y.%m.%d'
                unit: days
                unit_count: 30
                field:
                stats_result:
                epoch:
                exclude: False
 The Elasticsearch chart contains commented example actions for deleting and
 snapshotting indexes older 30 days.  Please note these actions are provided as a
 reference and are disabled by default to avoid any unexpected behavior against
 your indexes.
 Elasticsearch Exporter
 ----------------------
 The Elasticsearch chart contains templates for an exporter to provide metrics
 for Prometheus.  These metrics provide insight into the performance and overall
 health of your Elasticsearch cluster.  Please note monitoring for Elasticsearch
 is disabled by default, and must be enabled with the following override:
 ::
    monitoring:
      prometheus:
        enabled: true
 The Elasticsearch exporter uses the same service annotations as the other
 exporters, and no additional configuration is required for Prometheus to target
 the Elasticsearch exporter for scraping.  The Elasticsearch exporter is
 configured with command line flags, and the flags' default values can be found
 under the following key in the values.yaml file:
 ::
    conf:
      prometheus_elasticsearch_exporter:
        es:
          all: true
          timeout: 20s
 The configuration keys configure the following behaviors:
 - es.all:  Gather information from all nodes, not just the connecting node
 - es.timeout:  Timeout for metrics queries
 More information about the Elasticsearch exporter can be found on the exporter's
 GitHub_ page.
 .. _GitHub: https://github.com/justwatchcom/elasticsearch_exporter
 Snapshot Repositories
 ---------------------
 Before Curator can store snapshots in a specified repository, Elasticsearch must
 register the configured repository.  To achieve this, the Elasticsearch chart
 contains a job for registering an s3 snapshot repository backed by radosgateway.
 This job is disabled by default as the curator actions for snapshots are
 disabled by default.  To enable the snapshot job, the
 conf.elasticsearch.snapshots.enabled flag must be set to true.  The following
 configuration keys are relevant:
 - conf.elasticsearch.snapshots.enabled: Enable snapshot repositories
 - conf.elasticsearch.snapshots.bucket: Name of the RGW s3 bucket to use
 - conf.elasticsearch.snapshots.repositories: Name of repositories to create
 More information about Elasticsearch repositories can be found in the official
 Elasticsearch snapshot_ documentation:
 .. _snapshot: https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-snapshots.html#_repositories
--- a/doc/source/logging/fluent-logging.rst
+++ b/doc/source/logging/fluent-logging.rst
@ -0,0 +1,279 @@
 Fluent-logging
 ===============
 The fluent-logging chart in openstack-helm-infra provides the base for a
 centralized logging platform for OpenStack-Helm.  The chart combines two
 services, Fluentbit and Fluentd, to gather logs generated by the services,
 filter on or add metadata to logged events, then forward them to Elasticsearch
 for indexing.
 Fluentbit
 ---------
 Fluentbit runs as a log-collecting component on each host in the cluster, and
 can be configured to target specific log locations on the host.  The Fluentbit_
 configuration schema can be found on the official Fluentbit website.
 .. _Fluentbit: http://fluentbit.io/documentation/0.12/configuration/schema.html
 Fluentbit provides a set of plug-ins for ingesting and filtering various log
 types.  These plug-ins include:
 - Tail:  Tails a defined file for logged events
 - Kube:  Adds Kubernetes metadata to a logged event
 - Systemd:  Provides ability to collect logs from the journald daemon
 - Syslog:  Provides the ability to collect logs from a Unix socket (TCP or UDP)
 The complete list of plugins can be found in the configuration_ section of the
 Fluentbit documentation.
 .. _configuration: http://fluentbit.io/documentation/current/configuration/
 Fluentbit uses parsers to turn unstructured log entries into structured entries
 to make processing and filtering events easier.  The two formats supported are
 JSON maps and regular expressions.  More information about Fluentbit's parsing
 abilities can be found in the parsers_ section of Fluentbit's documentation.
 .. _parsers: http://fluentbit.io/documentation/current/parser/
 Fluentbit's service and parser configurations are defined via the values.yaml
 file, which allows for custom definitions of inputs, filters and outputs for
 your logging needs.
 Fluentbit's configuration can be found under the following key:
 ::
    conf:
      fluentbit:
        - service:
            header: service
            Flush: 1
            Daemon: Off
            Log_Level: info
            Parsers_File: parsers.conf
        - containers_tail:
            header: input
            Name: tail
            Tag: kube.*
            Path: /var/log/containers/*.log
            Parser: docker
            DB: /var/log/flb_kube.db
            Mem_Buf_Limit: 5MB
        - kube_filter:
            header: filter
            Name: kubernetes
            Match: kube.*
            Merge_JSON_Log: On
        - fluentd_output:
            header: output
            Name: forward
            Match: "*"
            Host: ${FLUENTD_HOST}
            Port: ${FLUENTD_PORT}
 Fluentbit is configured by default to capture logs at the info log level.  To
 change this, override the Log_Level key with the appropriate levels, which are
 documented in Fluentbit's configuration_.
 Fluentbit's parser configuration can be found under the following key:
 ::
    conf:
      parsers:
        - docker:
            header: parser
            Name: docker
            Format: json
            Time_Key: time
            Time_Format: "%Y-%m-%dT%H:%M:%S.%L"
            Time_Keep: On
 The values for the fluentbit and parsers keys are consumed by a fluent-logging
 helper template that produces the appropriate configurations for the relevant
 sections.  Each list item (keys prefixed with a '-') represents a section in the
 configuration files, and the arbitrary name of the list item should represent a
 logical description of the section defined.  The header key represents the type
 of definition (filter, input, output, service or parser), and the remaining
 entries will be rendered as space delimited configuration keys and values. For
 example, the definitions above would result in the following:
 ::
    [SERVICE]
        Daemon false
        Flush 1
        Log_Level info
        Parsers_File parsers.conf
    [INPUT]
        DB /var/log/flb_kube.db
        Mem_Buf_Limit 5MB
        Name tail
        Parser docker
        Path /var/log/containers/*.log
        Tag kube.*
    [FILTER]
        Match kube.*
        Merge_JSON_Log true
        Name kubernetes
    [OUTPUT]
        Host ${FLUENTD_HOST}
        Match *
        Name forward
        Port ${FLUENTD_PORT}
    [PARSER]
        Format json
        Name docker
        Time_Format %Y-%m-%dT%H:%M:%S.%L
        Time_Keep true
        Time_Key time
 Fluentd
 -------
 Fluentd runs as a forwarding service that receives event entries from Fluentbit
 and routes them to the appropriate destination.  By default, Fluentd will route
 all entries received from Fluentbit to Elasticsearch for indexing.  The
 Fluentd_ configuration schema can be found at the official Fluentd website.
 .. _Fluentd: https://docs.fluentd.org/v0.12/articles/config-file
 Fluentd's configuration is handled in the values.yaml file in fluent-logging.
 Similar to Fluentbit, configuration overrides provide flexibility in defining
 custom routes for tagged log events.  The configuration can be found under the
 following key:
 ::
    conf:
      fluentd:
        - fluentbit_forward:
            header: source
            type: forward
            port: "#{ENV['FLUENTD_PORT']}"
            bind: 0.0.0.0
        - elasticsearch:
            header: match
            type: elasticsearch
            expression: "**"
            include_tag_key: true
            host: "#{ENV['ELASTICSEARCH_HOST']}"
            port: "#{ENV['ELASTICSEARCH_PORT']}"
            logstash_format: true
            buffer_chunk_limit: 10M
            buffer_queue_limit: 32
            flush_interval: "20"
            max_retry_wait: 300
            disable_retry_limit: ""
 The values for the fluentd keys are consumed by a fluent-logging helper template
 that produces appropriate configurations for each directive desired.  The list
 items (keys prefixed with a '-') represent sections in the configuration file,
 and the name of each list item should represent a logical description of the
 section defined.  The header key represents the type of definition (name of the
 fluentd plug-in used), and the expression key is used when the plug-in requires
 a pattern to match against (example: matches on certain input patterns).  The
 remaining entries will be rendered as space delimited configuration keys and
 values.  For example, the definition above would result in the following:
 ::
    <source>
      bind 0.0.0.0
      port "#{ENV['FLUENTD_PORT']}"
      @type forward
    </source>
    <match **>
      buffer_chunk_limit 10M
      buffer_queue_limit 32
      disable_retry_limit
      flush_interval 20s
      host "#{ENV['ELASTICSEARCH_HOST']}"
      include_tag_key true
      logstash_format true
      max_retry_wait 300
      port "#{ENV['ELASTICSEARCH_PORT']}"
      @type elasticsearch
    </match>
 Some fluentd plug-ins require nested definitions.  The fluentd helper template
 can handle these definitions with the following structure:
 ::
    conf:
      td_agent:
        - fluentbit_forward:
            header: source
            type: forward
            port: "#{ENV['FLUENTD_PORT']}"
            bind: 0.0.0.0
        - log_transformer:
            header: filter
            type: record_transformer
            expression: "foo.bar"
            inner_def:
              - record_transformer:
                  header: record
                  hostname: my_host
                  tag: my_tag
 In this example, the my_transformer list will generate a nested configuration
 entry in the log_transformer section.  The nested definitions are handled by
 supplying a list as the value for an arbitrary key, and the list value will
 indicate the entry should be handled as a nested definition.  The helper
 template will render the above example key/value pairs as the following:
 ::
    <source>
      bind 0.0.0.0
      port "#{ENV['FLUENTD_PORT']}"
      @type forward
    </source>
    <filter foo.bar>
      <record>
        hostname my_host
        tag my_tag
      </record>
      @type record_transformer
    </filter>
 Fluentd Exporter
 ----------------------
 The fluent-logging chart contains templates for an exporter to provide metrics
 for Fluentd.  These metrics provide insight into Fluentd's performance.  Please
 note monitoring for Fluentd is disabled by default, and must be enabled with the
 following override:
 ::
    monitoring:
      prometheus:
        enabled: true
 The Fluentd exporter uses the same service annotations as the other exporters,
 and no additional configuration is required for Prometheus to target the
 Fluentd exporter for scraping.  The Fluentd exporter is configured with command
 line flags, and the flags' default values can be found under the following key
 in the values.yaml file:
 ::
    conf:
      fluentd_exporter:
        log:
          format: "logger:stdout?json=true"
          level: "info"
 The configuration keys configure the following behaviors:
 - log.format:  Define the logger used and format of the output
 - log.level:  Log level for the exporter to use
 More information about the Fluentd exporter can be found on the exporter's
 GitHub_ page.
 .. _GitHub: https://github.com/V3ckt0r/fluentd_exporter
--- a/doc/source/logging/index.rst
+++ b/doc/source/logging/index.rst
@ -0,0 +1,11 @@
 OpenStack-Helm Logging
 ======================
 Contents:
 .. toctree::
   :maxdepth: 2
   elasticsearch
   fluent-logging
   kibana
--- a/doc/source/logging/kibana.rst
+++ b/doc/source/logging/kibana.rst
@ -0,0 +1,76 @@
 Kibana
 ======
 The Kibana chart in OpenStack-Helm Infra provides visualization for logs indexed
 into Elasticsearch.  These visualizations provide the means to view logs captured
 from services deployed in cluster and targeted for collection by Fluentbit.
 Authentication
 --------------
 The Kibana deployment includes a sidecar container that runs an Apache reverse
 proxy to add authentication capabilities for Kibana.  The username and password
 are configured under the Kibana entry in the endpoints section of the chart's
 values.yaml.
 The configuration for Apache can be found under the conf.httpd key, and uses a
 helm-toolkit function that allows for including gotpl entries in the template
 directly.  This allows the use of other templates, like the endpoint lookup
 function templates, directly in the configuration for Apache.
 Configuration
 -------------
 Kibana's configuration is driven by the chart's values.yaml file.  The configuration
 options are found under the following keys:
 ::
    conf:
      elasticsearch:
        pingTimeout: 1500
        preserveHost: true
        requestTimeout: 30000
        shardTimeout: 0
        startupTimeout: 5000
      il8n:
        defaultLocale: en
      kibana:
        defaultAppId: discover
        index: .kibana
      logging:
        quiet: false
        silent: false
        verbose: false
      ops:
        interval: 5000
      server:
        host: localhost
        maxPayloadBytes: 1048576
        port: 5601
        ssl:
          enabled: false
 The case of the sub-keys is important as these values are injected into
 Kibana's configuration configmap with the toYaml function.  More information on
 the configuration options and available settings can be found in the official
 Kibana documentation_.
 .. _documentation: https://www.elastic.co/guide/en/kibana/current/settings.html
 Installation
 ------------
 .. code_block: bash
 helm install --namespace=<namespace> local/kibana --name=kibana
 Setting Time Field
 ------------------
 For Kibana to successfully read the logs from Elasticsearch's indexes, the time
 field will need to be manually set after Kibana has successfully deployed.  Upon
 visiting the Kibana dashboard for the first time, a prompt will appear to choose the
 time field with a drop down menu.  The default time field for Elasticsearch indexes
 is '@timestamp'.  Once this field is selected, the default view for querying log entries
 can be found by selecting the "Discover"
--- a/doc/source/monitoring/grafana.rst
+++ b/doc/source/monitoring/grafana.rst
@ -0,0 +1,89 @@
 Grafana
 =======
 The Grafana chart in OpenStack-Helm Infra provides default dashboards for the
 metrics gathered with Prometheus.  The default dashboards include visualizations
 for metrics on: Ceph, Kubernetes, nodes, containers, MySQL, RabbitMQ, and
 OpenStack.
 Configuration
 -------------
 Grafana
 ~~~~~~~
 Grafana's configuration is driven with the chart's values.YAML file, and the
 relevant configuration entries are under the following key:
 ::
    conf:
      grafana:
        paths:
        server:
        database:
        session:
        security:
        users:
        log:
        log.console:
        dashboards.json:
        grafana_net:
 These keys correspond to sections in the grafana.ini configuration file, and the
 to_ini helm-toolkit function will render these values into the appropriate
 format in grafana.ini.  The list of options for these keys can be found in the
 official Grafana configuration_ documentation.
 .. _configuration: http://docs.grafana.org/installation/configuration/
 Prometheus Data Source
 ~~~~~~~~~~~~~~~~~~~~~~
 Grafana requires configured data sources for gathering metrics for display in
 its dashboards.  The configuration options for datasources are found under the
 following key in Grafana's values.YAML file:
 ::
    conf:
      provisioning:
        datasources;
          monitoring:
            name: prometheus
            type: prometheus
            access: proxy
            orgId: 1
            editable: true
            basicAuth: true
 The Grafana chart will use the keys under each entry beneath
 .conf.provisioning.datasources as inputs to a helper template that will render
 the appropriate configuration for the data source.  The key for each data source
 (monitoring in the above example) should map to an entry in the endpoints
 section in the chart's values.yaml, as the data source's URL and authentication
 credentials will be populated by the values defined in the defined endpoint.
 .. _sources: http://docs.grafana.org/features/datasources/
 Dashboards
 ~~~~~~~~~~
 Grafana adds dashboards during installation with dashboards defined in YAML under
 the following key:
 ::
    conf:
      dashboards:
 These YAML definitiions are transformed to JSON are added to Grafana's
 configuration configmap and mounted to the Grafana pods dynamically, allowing for
 flexibility in defining and adding custom dashboards to Grafana.  Dashboards can
 be added by inserting a new key along with a YAML dashboard definition as the
 value.  Additional dashboards can be found by searching on Grafana's dashboards_
 page or you can define your own. A json-to-YAML tool, such as json2yaml_ , will
 help transform any custom or new dashboards from JSON to YAML.
 .. _json2yaml: https://www.json2yaml.com/
--- a/doc/source/monitoring/index.rst
+++ b/doc/source/monitoring/index.rst
@ -0,0 +1,11 @@
 OpenStack-Helm Monitoring
 =========================
 Contents:
 .. toctree::
   :maxdepth: 2
   grafana
   prometheus
   nagios
--- a/doc/source/monitoring/nagios.rst
+++ b/doc/source/monitoring/nagios.rst
@ -0,0 +1,365 @@
 Nagios
 ======
 The Nagios chart in openstack-helm-infra can be used to provide an alarming
 service that's tightly coupled to an OpenStack-Helm deployment.  The Nagios
 chart uses a custom Nagios core image that includes plugins developed to query
 Prometheus directly for scraped metrics and triggered alarms, query the Ceph
 manager endpoints directly to determine the health of a Ceph cluster, and to
 query Elasticsearch for logged events that meet certain criteria (experimental).
 Authentication
 --------------
 The Nagios deployment includes a sidecar container that runs an Apache reverse
 proxy to add authentication capabilities for Nagios.  The username and password
 are configured under the nagios entry in the endpoints section of the chart's
 values.yaml.
 The configuration for Apache can be found under the conf.httpd key, and uses a
 helm-toolkit function that allows for including gotpl entries in the template
 directly.  This allows the use of other templates, like the endpoint lookup
 function templates, directly in the configuration for Apache.
 Image Plugins
 -------------
 The Nagios image used contains custom plugins that can be used for the defined
 service check commands.  These plugins include:
 - check_prometheus_metric.py: Query Prometheus for a specific metric and value
 - check_exporter_health_metric.sh: Nagios plugin to query prometheus exporter
 - check_rest_get_api.py: Check REST API status
 - check_update_prometheus_hosts.py: Queries Prometheus, updates Nagios config
 - query_prometheus_alerts.py: Nagios plugin to query prometheus ALERTS metric
 More information about the Nagios image and plugins can be found here_.
 .. _here: https://github.com/att-comdev/nagios
 Nagios Service Configuration
 ----------------------------
 The Nagios service is configured via the following section in the chart's
 values file:
 ::
    conf:
      nagios:
        nagios:
          log_file: /opt/nagios/var/log/nagios.log
          cfg_file:
            - /opt/nagios/etc/nagios_objects.cfg
            - /opt/nagios/etc/objects/commands.cfg
            - /opt/nagios/etc/objects/contacts.cfg
            - /opt/nagios/etc/objects/timeperiods.cfg
            - /opt/nagios/etc/objects/templates.cfg
            - /opt/nagios/etc/objects/prometheus_discovery_objects.cfg
          object_cache_file: /opt/nagios/var/objects.cache
          precached_object_file: /opt/nagios/var/objects.precache
          resource_file: /opt/nagios/etc/resource.cfg
          status_file: /opt/nagios/var/status.dat
          status_update_interval: 10
          nagios_user: nagios
          nagios_group: nagios
          check_external_commands: 1
          command_file: /opt/nagios/var/rw/nagios.cmd
          lock_file: /var/run/nagios.lock
          temp_file: /opt/nagios/var/nagios.tmp
          temp_path: /tmp
          event_broker_options: -1
          log_rotation_method: d
          log_archive_path: /opt/nagios/var/log/archives
          use_syslog: 1
          log_service_retries: 1
          log_host_retries: 1
          log_event_handlers: 1
          log_initial_states: 0
          log_current_states: 1
          log_external_commands: 1
          log_passive_checks: 1
          service_inter_check_delay_method: s
          max_service_check_spread: 30
          service_interleave_factor: s
          host_inter_check_delay_method: s
          max_host_check_spread: 30
          max_concurrent_checks: 60
          check_result_reaper_frequency: 10
          max_check_result_reaper_time: 30
          check_result_path: /opt/nagios/var/spool/checkresults
          max_check_result_file_age: 3600
          cached_host_check_horizon: 15
          cached_service_check_horizon: 15
          enable_predictive_host_dependency_checks: 1
          enable_predictive_service_dependency_checks: 1
          soft_state_dependencies: 0
          auto_reschedule_checks: 0
          auto_rescheduling_interval: 30
          auto_rescheduling_window: 180
          service_check_timeout: 60
          host_check_timeout: 60
          event_handler_timeout: 60
          notification_timeout: 60
          ocsp_timeout: 5
          perfdata_timeout: 5
          retain_state_information: 1
          state_retention_file: /opt/nagios/var/retention.dat
          retention_update_interval: 60
          use_retained_program_state: 1
          use_retained_scheduling_info: 1
          retained_host_attribute_mask: 0
          retained_service_attribute_mask: 0
          retained_process_host_attribute_mask: 0
          retained_process_service_attribute_mask: 0
          retained_contact_host_attribute_mask: 0
          retained_contact_service_attribute_mask: 0
          interval_length: 1
          check_workers: 4
          check_for_updates: 1
          bare_update_check: 0
          use_aggressive_host_checking: 0
          execute_service_checks: 1
          accept_passive_service_checks: 1
          execute_host_checks: 1
          accept_passive_host_checks: 1
          enable_notifications: 1
          enable_event_handlers: 1
          process_performance_data: 0
          obsess_over_services: 0
          obsess_over_hosts: 0
          translate_passive_host_checks: 0
          passive_host_checks_are_soft: 0
          check_for_orphaned_services: 1
          check_for_orphaned_hosts: 1
          check_service_freshness: 1
          service_freshness_check_interval: 60
          check_host_freshness: 0
          host_freshness_check_interval: 60
          additional_freshness_latency: 15
          enable_flap_detection: 1
          low_service_flap_threshold: 5.0
          high_service_flap_threshold: 20.0
          low_host_flap_threshold: 5.0
          high_host_flap_threshold: 20.0
          date_format: us
          use_regexp_matching: 1
          use_true_regexp_matching: 0
          daemon_dumps_core: 0
          use_large_installation_tweaks: 0
          enable_environment_macros: 0
          debug_level: 0
          debug_verbosity: 1
          debug_file: /opt/nagios/var/nagios.debug
          max_debug_file_size: 1000000
          allow_empty_hostgroup_assignment: 1
          illegal_macro_output_chars: "`~$&|'<>\""
 Nagios CGI Configuration
 ------------------------
 The Nagios CGI configuration is defined via the following section in the chart's
 values file:
 ::
    conf:
      nagios:
        cgi:
          main_config_file: /opt/nagios/etc/nagios.cfg
          physical_html_path: /opt/nagios/share
          url_html_path: /nagios
          show_context_help: 0
          use_pending_states: 1
          use_authentication: 0
          use_ssl_authentication: 0
          authorized_for_system_information: "*"
          authorized_for_configuration_information: "*"
          authorized_for_system_commands: nagiosadmin
          authorized_for_all_services: "*"
          authorized_for_all_hosts: "*"
          authorized_for_all_service_commands: "*"
          authorized_for_all_host_commands: "*"
          default_statuswrl_layout: 4
          ping_syntax: /bin/ping -n -U -c 5 $HOSTADDRESS$
          refresh_rate: 90
          result_limit: 100
          escape_html_tags: 1
          action_url_target: _blank
          notes_url_target: _blank
          lock_author_names: 1
          navbar_search_for_addresses: 1
          navbar_search_for_aliases: 1
 Nagios Host Configuration
 -------------------------
 The Nagios chart includes a single host definition for the Prometheus instance
 queried for metrics.  The host definition can be found under the following
 values key:
 ::
    conf:
      nagios:
        hosts:
          - prometheus:
              use: linux-server
              host_name: prometheus
              alias: "Prometheus Monitoring"
              address: 127.0.0.1
              hostgroups: prometheus-hosts
              check_command: check-prometheus-host-alive
 The address for the Prometheus host is defined by the PROMETHEUS_SERVICE
 environment variable in the deployment template, which is determined by the
 monitoring entry in the Nagios chart's endpoints section.  The endpoint is then
 available as a macro for Nagios to use in all Prometheus based queries.  For
 example:
 ::
    - check_prometheus_host_alive:
        command_name: check-prometheus-host-alive
        command_line: "$USER1$/check_rest_get_api.py --url $USER2$ --warning_response_seconds 5 --critical_response_seconds 10"
 The $USER2$ macro above corresponds to the Prometheus endpoint defined in the
 PROMETHEUS_SERVICE environment variable.  All checks that use the
 prometheus-hosts hostgroup will map back to the Prometheus host defined by this
 endpoint.
 Nagios HostGroup Configuration
 ------------------------------
 The Nagios chart includes configuration values for defined host groups under the
 following values key:
 ::
    conf:
      nagios:
        host_groups:
          - prometheus-hosts:
              hostgroup_name: prometheus-hosts
              alias: "Prometheus Virtual Host"
          - base-os:
              hostgroup_name: base-os
              alias: "base-os"
 These hostgroups are used to define which group of hosts should be targeted by
 a particular nagios check.  An example of a check that targets Prometheus for a
 specific metric query would be:
 ::
    - check_ceph_monitor_quorum:
        use: notifying_service
        hostgroup_name: prometheus-hosts
        service_description: "CEPH_quorum"
        check_command: check_prom_alert!ceph_monitor_quorum_low!CRITICAL- ceph monitor quorum does not exist!OK- ceph monitor quorum exists
        check_interval: 60
 An example of a check that targets all hosts for a base-os type check (memory
 usage, latency, etc) would be:
 ::
    - check_memory_usage:
        use: notifying_service
        service_description: Memory_usage
        check_command: check_memory_usage
        hostgroup_name: base-os
 These two host groups allow for a wide range of targeted checks for determining
 the status of all components of an OpenStack-Helm deployment.
 Nagios Command Configuration
 ----------------------------
 The Nagios chart includes configuration values for the command definitions Nagios
 will use when executing service checks. These values are found under the
 following key:
 ::
    conf:
      nagios:
        commands:
          - send_service_snmp_trap:
              command_name: send_service_snmp_trap
              command_line: "$USER1$/send_service_trap.sh '$USER8$' '$HOSTNAME$' '$SERVICEDESC$' $SERVICESTATEID$ '$SERVICEOUTPUT$' '$USER4$' '$USER5$'"
          - send_host_snmp_trap:
              command_name: send_host_snmp_trap
              command_line: "$USER1$/send_host_trap.sh '$USER8$' '$HOSTNAME$' $HOSTSTATEID$ '$HOSTOUTPUT$' '$USER4$' '$USER5$'"
          - send_service_http_post:
              command_name: send_service_http_post
              command_line: "$USER1$/send_http_post_event.py --type service --hostname '$HOSTNAME$' --servicedesc '$SERVICEDESC$' --state_id $SERVICESTATEID$ --output '$SERVICEOUTPUT$' --monitoring_hostname '$HOSTNAME$' --primary_url '$USER6$' --secondary_url '$USER7$'"
          - send_host_http_post:
              command_name: send_host_http_post
              command_line: "$USER1$/send_http_post_event.py --type host --hostname '$HOSTNAME$' --state_id $HOSTSTATEID$ --output '$HOSTOUTPUT$' --monitoring_hostname '$HOSTNAME$' --primary_url '$USER6$' --secondary_url '$USER7$'"
          - check_prometheus_host_alive:
              command_name: check-prometheus-host-alive
              command_line: "$USER1$/check_rest_get_api.py --url $USER2$ --warning_response_seconds 5 --critical_response_seconds 10"
 The list of defined commands can be modified with configuration overrides, which
 allows for the ability define commands specific to an infrastructure deployment.
 These commands can include querying Prometheus for metrics on dependencies for a
 service to determine whether an alert should be raised, executing checks on each
 host to determine network latency or file system usage, or checking each node
 for issues with ntp clock skew.
 Note: Since the conf.nagios.commands key contains a list of the defined commands,
 the entire contents of conf.nagios.commands will need to be overridden if
 additional commands are desired (due to the immutable nature of lists).
 Nagios Service Check Configuration
 ----------------------------------
 The Nagios chart includes configuration values for the service checks Nagios
 will execute.  These service check commands can be found under the following
 key:
 ::
    conf:
      nagios:
        services:
          - notifying_service:
              name: notifying_service
              use: generic-service
              flap_detection_enabled: 0
              process_perf_data: 0
              contact_groups: snmp_and_http_notifying_contact_group
              check_interval: 60
              notification_interval: 120
              retry_interval: 30
              register: 0
          - check_ceph_health:
              use: notifying_service
              hostgroup_name: base-os
              service_description: "CEPH_health"
              check_command: check_ceph_health
              check_interval: 300
          - check_hosts_health:
              use: generic-service
              hostgroup_name: prometheus-hosts
              service_description: "Nodes_health"
              check_command: check_prom_alert!K8SNodesNotReady!CRITICAL- One or more nodes are not ready.
              check_interval: 60
          - check_prometheus_replicas:
              use: notifying_service
              hostgroup_name: prometheus-hosts
              service_description: "Prometheus_replica-count"
              check_command: check_prom_alert_with_labels!replicas_unavailable_statefulset!statefulset="prometheus"!statefulset {statefulset} has lesser than configured replicas
              check_interval: 60
 The Nagios service configurations define the checks Nagios will perform.  These
 checks contain keys for defining: the service type to use, the host group to
 target, the description of the service check, the command the check should use,
 and the interval at which to trigger the service check.  These services can also
 be extended to provide additional insight into the overall status of a
 particular service.  These services also allow the ability to define advanced
 checks for determining the overall health and liveness of a service.  For
 example, a service check could trigger an alarm for the OpenStack services when
 Nagios detects that the relevant database and message queue has become
 unresponsive.
--- a/doc/source/monitoring/prometheus.rst
+++ b/doc/source/monitoring/prometheus.rst
@ -0,0 +1,338 @@
 Prometheus
 ==========
 The Prometheus chart in openstack-helm-infra provides a time series database and
 a strong querying language for monitoring various components of OpenStack-Helm.
 Prometheus gathers metrics by scraping defined service endpoints or pods at
 specified intervals and indexing them in the underlying time series database.
 Authentication
 --------------
 The Prometheus deployment includes a sidecar container that runs an Apache
 reverse proxy to add authentication capabilities for Prometheus.  The
 username and password are configured under the monitoring entry in the endpoints
 section of the chart's values.yaml.
 The configuration for Apache can be found under the conf.httpd key, and uses a
 helm-toolkit function that allows for including gotpl entries in the template
 directly.  This allows the use of other templates, like the endpoint lookup
 function templates, directly in the configuration for Apache.
 Prometheus Service configuration
 --------------------------------
 The Prometheus service is configured via command line flags set during runtime.
 These flags include: setting the configuration file, setting log levels, setting
 characteristics of the time series database, and enabling the web admin API for
 snapshot support.  These settings can be configured via the values tree at:
 ::
    conf:
      prometheus:
        command_line_flags:
          log.level: info
          query.max_concurrency: 20
          query.timeout: 2m
          storage.tsdb.path: /var/lib/prometheus/data
          storage.tsdb.retention: 7d
          web.enable_admin_api: false
          web.enable_lifecycle: false
 The Prometheus configuration file contains the definitions for scrape targets
 and the location of the rules files for triggering alerts on scraped metrics.
 The configuration file is defined in the values file, and can be found at:
 ::
    conf:
      prometheus:
        scrape_configs: |
 By defining the configuration via the values file, an operator can override all
 configuration components of the Prometheus deployment at runtime.
 Kubernetes Endpoint Configuration
 ---------------------------------
 The Prometheus chart in openstack-helm-infra uses the built-in service discovery
 mechanisms for Kubernetes endpoints and pods to automatically configure scrape
 targets.  Functions added to helm-toolkit allows configuration of these targets
 via annotations that can be applied to any service or pod that exposes metrics
 for Prometheus, whether a service for an application-specific exporter or an
 application that provides a metrics endpoint via its service. The values in
 these functions correspond to entries in the monitoring tree under the
 prometheus key in a chart's values.yaml file.
 The functions definitions are below:
 ::
    {{- define "helm-toolkit.snippets.prometheus_service_annotations" -}}
    {{- $config := index . 0 -}}
    {{- if $config.scrape }}
    prometheus.io/scrape: {{ $config.scrape | quote }}
    {{- end }}
    {{- if $config.scheme }}
    prometheus.io/scheme: {{ $config.scheme | quote }}
    {{- end }}
    {{- if $config.path }}
    prometheus.io/path: {{ $config.path | quote }}
    {{- end }}
    {{- if $config.port }}
    prometheus.io/port: {{ $config.port | quote }}
    {{- end }}
    {{- end -}}
 ::
    {{- define "helm-toolkit.snippets.prometheus_pod_annotations" -}}
    {{- $config := index . 0 -}}
    {{- if $config.scrape }}
    prometheus.io/scrape: {{ $config.scrape | quote }}
    {{- end }}
    {{- if $config.path }}
    prometheus.io/path: {{ $config.path | quote }}
    {{- end }}
    {{- if $config.port }}
    prometheus.io/port: {{ $config.port | quote }}
    {{- end }}
    {{- end -}}
 These functions render the following annotations:
 - prometheus.io/scrape:  Must be set to true for Prometheus to scrape target
 - prometheus.io/scheme:  Overrides scheme used to scrape target if not http
 - prometheus.io/path:    Overrides path used to scrape target metrics if not /metrics
 - prometheus.io/port:    Overrides port to scrape metrics on if not service's default port
 Each chart that can be targeted for monitoring by Prometheus has a prometheus
 section under a monitoring tree in the chart's values.yaml, and Prometheus
 monitoring is disabled by default for those services.  Example values for the
 required entries can be found in the following monitoring configuration for the
 prometheus-node-exporter chart:
 ::
    monitoring:
      prometheus:
        enabled: false
        node_exporter:
          scrape: true
 If the prometheus.enabled key is set to true, the annotations are set on the
 targeted service or pod as the condition for applying the annotations evaluates
 to true.  For example:
 ::
    {{- $prometheus_annotations := $envAll.Values.monitoring.prometheus.node_exporter }}
    ---
    apiVersion: v1
    kind: Service
    metadata:
    name: {{ tuple "node_metrics" "internal" . | include "helm-toolkit.endpoints.hostname_short_endpoint_lookup" }}
    labels:
    {{ tuple $envAll "node_exporter" "metrics" | include "helm-toolkit.snippets.kubernetes_metadata_labels" | indent 4 }}
    annotations:
    {{- if .Values.monitoring.prometheus.enabled }}
    {{ tuple $prometheus_annotations | include "helm-toolkit.snippets.prometheus_service_annotations" | indent 4 }}
    {{- end }}
 Kubelet, API Server, and cAdvisor
 ---------------------------------
 The Prometheus chart includes scrape target configurations for the kubelet, the
 Kubernetes API servers, and cAdvisor.  These targets are configured based on
 a kubeadm deployed Kubernetes cluster, as OpenStack-Helm uses kubeadm to deploy
 Kubernetes in the gates.  These configurations may need to change based on your
 chosen method of deployment.  Please note the cAdvisor metrics will not be
 captured if the kubelet was started with the following flag:
 ::
    --cadvisor-port=0
 To enable the gathering of the kubelet's custom metrics, the following flag must
 be set:
 ::
    --enable-custom-metrics
 Installation
 ------------
 The Prometheus chart can be installed with the following command:
 .. code-block:: bash
    helm install --namespace=openstack local/prometheus --name=prometheus
 The above command results in a Prometheus deployment configured to automatically
 discover services with the necessary annotations for scraping, configured to
 gather metrics on the kubelet, the Kubernetes API servers, and cAdvisor.
 Extending Prometheus
 --------------------
 Prometheus can target various exporters to gather metrics related to specific
 applications to extend visibility into an OpenStack-Helm deployment.  Currently,
 openstack-helm-infra contains charts for:
 - prometheus-kube-state-metrics: Provides additional Kubernetes metrics
 - prometheus-node-exporter: Provides metrics for nodes and linux kernels
 - prometheus-openstack-metrics-exporter: Provides metrics for OpenStack services
 Kube-State-Metrics
 ~~~~~~~~~~~~~~~~~~
 The prometheus-kube-state-metrics chart provides metrics for Kubernetes objects
 as well as metrics for kube-scheduler and kube-controller-manager.  Information
 on the specific metrics available via the kube-state-metrics service can be
 found in the kube-state-metrics_ documentation.
 The prometheus-kube-state-metrics chart can be installed with the following:
 .. code-block:: bash
    helm install --namespace=kube-system local/prometheus-kube-state-metrics --name=prometheus-kube-state-metrics
 .. _kube-state-metrics: https://github.com/kubernetes/kube-state-metrics/tree/master/Documentation
 Node Exporter
 ~~~~~~~~~~~~~
 The prometheus-node-exporter chart provides hardware and operating system metrics
 exposed via Linux kernels.  Information on the specific metrics available via
 the Node exporter can be found on the Node_exporter_ GitHub page.
 The prometheus-node-exporter chart can be installed with the following:
 .. code-block:: bash
    helm install --namespace=kube-system local/prometheus-node-exporter --name=prometheus-node-exporter
 .. _Node_exporter: https://github.com/prometheus/node_exporter
 OpenStack Exporter
 ~~~~~~~~~~~~~~~~~~
 The prometheus-openstack-exporter chart provides metrics specific to the
 OpenStack services.  The exporter's source code can be found here_. While the
 metrics provided are by no means comprehensive, they will be expanded upon.
 Please note the OpenStack exporter requires the creation of a Keystone user to
 successfully gather metrics.  To create the required user, the chart uses the
 same keystone user management job the OpenStack service charts use.
 The prometheus-openstack-exporter chart can be installed with the following:
 .. code-block:: bash
    helm install --namespace=openstack local/prometheus-openstack-exporter --name=prometheus-openstack-exporter
 .. _here: https://github.com/att-comdev/openstack-metrics-collector
 Other exporters
 ~~~~~~~~~~~~~~~
 Certain charts in OpenStack-Helm include templates for application-specific
 Prometheus exporters, which keeps the monitoring of those services tightly coupled
 to the chart.  The templates for these exporters can be found in the monitoring
 subdirectory in the chart.  These exporters are disabled by default, and can be
 enabled by setting the appropriate flag in the monitoring.prometheus key of the
 chart's values.yaml file.  The charts containing exporters include:
 - Elasticsearch_
 - RabbitMQ_
 - MariaDB_
 - Memcached_
 - Fluentd_
 - Postgres_
 .. _Elasticsearch: https://github.com/justwatchcom/elasticsearch_exporter
 .. _RabbitMQ: https://github.com/kbudde/rabbitmq_exporter
 .. _MariaDB: https://github.com/prometheus/mysqld_exporter
 .. _Memcached: https://github.com/prometheus/memcached_exporter
 .. _Fluentd: https://github.com/V3ckt0r/fluentd_exporter
 .. _Postgres: https://github.com/wrouesnel/postgres_exporter
 Ceph
 ~~~~
 Starting with Luminous, Ceph can export metrics with ceph-mgr prometheus module.
 This module can be enabled in Ceph's values.yaml under the ceph_mgr_enabled_plugins
 key by appending prometheus to the list of enabled modules.  After enabling the
 prometheus module, metrics can be scraped on the ceph-mgr service endpoint.  This
 relies on the Prometheus annotations attached to the ceph-mgr service template, and
 these annotations can be modified in the endpoints section of Ceph's values.yaml
 file.  Information on the specific metrics available via the prometheus module
 can be found in the Ceph prometheus_ module documentation.
 .. _prometheus: http://docs.ceph.com/docs/master/mgr/prometheus/
 Prometheus Dashboard
 --------------------
 Prometheus includes a dashboard that can be accessed via the accessible
 Prometheus endpoint (NodePort or otherwise).  This dashboard will give you a
 view of your scrape targets' state, the configuration values for Prometheus's
 scrape jobs and command line flags, a view of any alerts triggered based on the
 defined rules, and a means for using PromQL to query scraped metrics.  The
 Prometheus dashboard is a useful tool for verifying Prometheus is configured
 appropriately and to verify the status of any services targeted for scraping via
 the Prometheus service discovery annotations.
 Rules Configuration
 -------------------
 Prometheus provides a querying language that can operate on defined rules which
 allow for the generation of alerts on specific metrics.  The Prometheus chart in
 openstack-helm-infra defines these rules via the values.yaml file.  By defining
 these in the values file, it allows operators flexibility to provide specific
 rules via overrides at installation.  The following rules keys are provided:
 ::
    values:
      conf:
        rules:
          alertmanager:
          etcd3:
          kube_apiserver:
          kube_controller_manager:
          kubelet:
          kubernetes:
          rabbitmq:
          mysql:
          ceph:
          openstack:
          custom:
 These provided keys provide recording and alert rules for all infrastructure
 components of an OpenStack-Helm deployment.  If you wish to exclude rules for a
 component, leave the tree empty in an overrides file.  To read more
 about Prometheus recording and alert rules definitions, please see the official
 Prometheus recording_ and alert_ rules documentation.
 .. _recording: https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/
 .. _alert: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
 Note: Prometheus releases prior to 2.0 used gotpl to define rules.  Prometheus
 2.0 changed the rules format to YAML, making them much easier to read.  The
 Prometheus chart in openstack-helm-infra uses Prometheus 2.0 by default to take
 advantage of changes to the underlying storage layer and the handling of stale
 data.  The chart will not support overrides for Prometheus versions below 2.0,
 as the command line flags for the service changed between versions.
 The wide range of exporters included in OpenStack-Helm coupled with the ability
 to define rules with configuration overrides allows for the addition of custom
 alerting and recording rules to fit an operator's monitoring needs.  Adding new
 rules or modifying existing rules require overrides for either an existing key
 under conf.rules or the addition of a new key under conf.rules.  The addition
 of custom rules can be used to define complex checks that can be extended for
 determining the liveliness or health of infrastructure components.
--- a/doc/source/readme.rst
+++ b/doc/source/readme.rst
@ -0,0 +1 @@
 .. include:: ../../README.rst