Steve Wilkerson eab9ca05a6 Foundation for LMA docs
This begins building documentation for the LMA services included
in openstack-helm-infra. This includes documentation for: kibana,
elasticsearch, fluent-logging, grafana, prometheus, and nagios

Change-Id: Iaa24be04748e76fabca998972398802e7e921ef1
Signed-off-by: Steve Wilkerson <wilkers.steve@gmail.com>
2019-01-07 21:02:54 +00:00

14 KiB

Nagios

The Nagios chart in openstack-helm-infra can be used to provide an alarming service that's tightly coupled to an OpenStack-Helm deployment. The Nagios chart uses a custom Nagios core image that includes plugins developed to query Prometheus directly for scraped metrics and triggered alarms, query the Ceph manager endpoints directly to determine the health of a Ceph cluster, and to query Elasticsearch for logged events that meet certain criteria (experimental).

Authentication

The Nagios deployment includes a sidecar container that runs an Apache reverse proxy to add authentication capabilities for Nagios. The username and password are configured under the nagios entry in the endpoints section of the chart's values.yaml.

The configuration for Apache can be found under the conf.httpd key, and uses a helm-toolkit function that allows for including gotpl entries in the template directly. This allows the use of other templates, like the endpoint lookup function templates, directly in the configuration for Apache.

Image Plugins

The Nagios image used contains custom plugins that can be used for the defined service check commands. These plugins include:

  • check_prometheus_metric.py: Query Prometheus for a specific metric and value
  • check_exporter_health_metric.sh: Nagios plugin to query prometheus exporter
  • check_rest_get_api.py: Check REST API status
  • check_update_prometheus_hosts.py: Queries Prometheus, updates Nagios config
  • query_prometheus_alerts.py: Nagios plugin to query prometheus ALERTS metric

More information about the Nagios image and plugins can be found here.

Nagios Service Configuration

The Nagios service is configured via the following section in the chart's values file:

conf:
  nagios:
    nagios:
      log_file: /opt/nagios/var/log/nagios.log
      cfg_file:
        - /opt/nagios/etc/nagios_objects.cfg
        - /opt/nagios/etc/objects/commands.cfg
        - /opt/nagios/etc/objects/contacts.cfg
        - /opt/nagios/etc/objects/timeperiods.cfg
        - /opt/nagios/etc/objects/templates.cfg
        - /opt/nagios/etc/objects/prometheus_discovery_objects.cfg
      object_cache_file: /opt/nagios/var/objects.cache
      precached_object_file: /opt/nagios/var/objects.precache
      resource_file: /opt/nagios/etc/resource.cfg
      status_file: /opt/nagios/var/status.dat
      status_update_interval: 10
      nagios_user: nagios
      nagios_group: nagios
      check_external_commands: 1
      command_file: /opt/nagios/var/rw/nagios.cmd
      lock_file: /var/run/nagios.lock
      temp_file: /opt/nagios/var/nagios.tmp
      temp_path: /tmp
      event_broker_options: -1
      log_rotation_method: d
      log_archive_path: /opt/nagios/var/log/archives
      use_syslog: 1
      log_service_retries: 1
      log_host_retries: 1
      log_event_handlers: 1
      log_initial_states: 0
      log_current_states: 1
      log_external_commands: 1
      log_passive_checks: 1
      service_inter_check_delay_method: s
      max_service_check_spread: 30
      service_interleave_factor: s
      host_inter_check_delay_method: s
      max_host_check_spread: 30
      max_concurrent_checks: 60
      check_result_reaper_frequency: 10
      max_check_result_reaper_time: 30
      check_result_path: /opt/nagios/var/spool/checkresults
      max_check_result_file_age: 3600
      cached_host_check_horizon: 15
      cached_service_check_horizon: 15
      enable_predictive_host_dependency_checks: 1
      enable_predictive_service_dependency_checks: 1
      soft_state_dependencies: 0
      auto_reschedule_checks: 0
      auto_rescheduling_interval: 30
      auto_rescheduling_window: 180
      service_check_timeout: 60
      host_check_timeout: 60
      event_handler_timeout: 60
      notification_timeout: 60
      ocsp_timeout: 5
      perfdata_timeout: 5
      retain_state_information: 1
      state_retention_file: /opt/nagios/var/retention.dat
      retention_update_interval: 60
      use_retained_program_state: 1
      use_retained_scheduling_info: 1
      retained_host_attribute_mask: 0
      retained_service_attribute_mask: 0
      retained_process_host_attribute_mask: 0
      retained_process_service_attribute_mask: 0
      retained_contact_host_attribute_mask: 0
      retained_contact_service_attribute_mask: 0
      interval_length: 1
      check_workers: 4
      check_for_updates: 1
      bare_update_check: 0
      use_aggressive_host_checking: 0
      execute_service_checks: 1
      accept_passive_service_checks: 1
      execute_host_checks: 1
      accept_passive_host_checks: 1
      enable_notifications: 1
      enable_event_handlers: 1
      process_performance_data: 0
      obsess_over_services: 0
      obsess_over_hosts: 0
      translate_passive_host_checks: 0
      passive_host_checks_are_soft: 0
      check_for_orphaned_services: 1
      check_for_orphaned_hosts: 1
      check_service_freshness: 1
      service_freshness_check_interval: 60
      check_host_freshness: 0
      host_freshness_check_interval: 60
      additional_freshness_latency: 15
      enable_flap_detection: 1
      low_service_flap_threshold: 5.0
      high_service_flap_threshold: 20.0
      low_host_flap_threshold: 5.0
      high_host_flap_threshold: 20.0
      date_format: us
      use_regexp_matching: 1
      use_true_regexp_matching: 0
      daemon_dumps_core: 0
      use_large_installation_tweaks: 0
      enable_environment_macros: 0
      debug_level: 0
      debug_verbosity: 1
      debug_file: /opt/nagios/var/nagios.debug
      max_debug_file_size: 1000000
      allow_empty_hostgroup_assignment: 1
      illegal_macro_output_chars: "`~$&|'<>\""

Nagios CGI Configuration

The Nagios CGI configuration is defined via the following section in the chart's values file:

conf:
  nagios:
    cgi:
      main_config_file: /opt/nagios/etc/nagios.cfg
      physical_html_path: /opt/nagios/share
      url_html_path: /nagios
      show_context_help: 0
      use_pending_states: 1
      use_authentication: 0
      use_ssl_authentication: 0
      authorized_for_system_information: "*"
      authorized_for_configuration_information: "*"
      authorized_for_system_commands: nagiosadmin
      authorized_for_all_services: "*"
      authorized_for_all_hosts: "*"
      authorized_for_all_service_commands: "*"
      authorized_for_all_host_commands: "*"
      default_statuswrl_layout: 4
      ping_syntax: /bin/ping -n -U -c 5 $HOSTADDRESS$
      refresh_rate: 90
      result_limit: 100
      escape_html_tags: 1
      action_url_target: _blank
      notes_url_target: _blank
      lock_author_names: 1
      navbar_search_for_addresses: 1
      navbar_search_for_aliases: 1

Nagios Host Configuration

The Nagios chart includes a single host definition for the Prometheus instance queried for metrics. The host definition can be found under the following values key:

conf:
  nagios:
    hosts:
      - prometheus:
          use: linux-server
          host_name: prometheus
          alias: "Prometheus Monitoring"
          address: 127.0.0.1
          hostgroups: prometheus-hosts
          check_command: check-prometheus-host-alive

The address for the Prometheus host is defined by the PROMETHEUS_SERVICE environment variable in the deployment template, which is determined by the monitoring entry in the Nagios chart's endpoints section. The endpoint is then available as a macro for Nagios to use in all Prometheus based queries. For example:

- check_prometheus_host_alive:
    command_name: check-prometheus-host-alive
    command_line: "$USER1$/check_rest_get_api.py --url $USER2$ --warning_response_seconds 5 --critical_response_seconds 10"

The $USER2$ macro above corresponds to the Prometheus endpoint defined in the PROMETHEUS_SERVICE environment variable. All checks that use the prometheus-hosts hostgroup will map back to the Prometheus host defined by this endpoint.

Nagios HostGroup Configuration

The Nagios chart includes configuration values for defined host groups under the following values key:

conf:
  nagios:
    host_groups:
      - prometheus-hosts:
          hostgroup_name: prometheus-hosts
          alias: "Prometheus Virtual Host"
      - base-os:
          hostgroup_name: base-os
          alias: "base-os"

These hostgroups are used to define which group of hosts should be targeted by a particular nagios check. An example of a check that targets Prometheus for a specific metric query would be:

- check_ceph_monitor_quorum:
    use: notifying_service
    hostgroup_name: prometheus-hosts
    service_description: "CEPH_quorum"
    check_command: check_prom_alert!ceph_monitor_quorum_low!CRITICAL- ceph monitor quorum does not exist!OK- ceph monitor quorum exists
    check_interval: 60

An example of a check that targets all hosts for a base-os type check (memory usage, latency, etc) would be:

- check_memory_usage:
    use: notifying_service
    service_description: Memory_usage
    check_command: check_memory_usage
    hostgroup_name: base-os

These two host groups allow for a wide range of targeted checks for determining the status of all components of an OpenStack-Helm deployment.

Nagios Command Configuration

The Nagios chart includes configuration values for the command definitions Nagios will use when executing service checks. These values are found under the following key:

conf:
  nagios:
    commands:
      - send_service_snmp_trap:
          command_name: send_service_snmp_trap
          command_line: "$USER1$/send_service_trap.sh '$USER8$' '$HOSTNAME$' '$SERVICEDESC$' $SERVICESTATEID$ '$SERVICEOUTPUT$' '$USER4$' '$USER5$'"
      - send_host_snmp_trap:
          command_name: send_host_snmp_trap
          command_line: "$USER1$/send_host_trap.sh '$USER8$' '$HOSTNAME$' $HOSTSTATEID$ '$HOSTOUTPUT$' '$USER4$' '$USER5$'"
      - send_service_http_post:
          command_name: send_service_http_post
          command_line: "$USER1$/send_http_post_event.py --type service --hostname '$HOSTNAME$' --servicedesc '$SERVICEDESC$' --state_id $SERVICESTATEID$ --output '$SERVICEOUTPUT$' --monitoring_hostname '$HOSTNAME$' --primary_url '$USER6$' --secondary_url '$USER7$'"
      - send_host_http_post:
          command_name: send_host_http_post
          command_line: "$USER1$/send_http_post_event.py --type host --hostname '$HOSTNAME$' --state_id $HOSTSTATEID$ --output '$HOSTOUTPUT$' --monitoring_hostname '$HOSTNAME$' --primary_url '$USER6$' --secondary_url '$USER7$'"
      - check_prometheus_host_alive:
          command_name: check-prometheus-host-alive
          command_line: "$USER1$/check_rest_get_api.py --url $USER2$ --warning_response_seconds 5 --critical_response_seconds 10"

The list of defined commands can be modified with configuration overrides, which allows for the ability define commands specific to an infrastructure deployment. These commands can include querying Prometheus for metrics on dependencies for a service to determine whether an alert should be raised, executing checks on each host to determine network latency or file system usage, or checking each node for issues with ntp clock skew.

Note: Since the conf.nagios.commands key contains a list of the defined commands, the entire contents of conf.nagios.commands will need to be overridden if additional commands are desired (due to the immutable nature of lists).

Nagios Service Check Configuration

The Nagios chart includes configuration values for the service checks Nagios will execute. These service check commands can be found under the following key:

::
conf:
nagios:
services:
  • notifying_service:

    name: notifying_service use: generic-service flap_detection_enabled: 0 process_perf_data: 0 contact_groups: snmp_and_http_notifying_contact_group check_interval: 60 notification_interval: 120 retry_interval: 30 register: 0

  • check_ceph_health:

    use: notifying_service hostgroup_name: base-os service_description: "CEPH_health" check_command: check_ceph_health check_interval: 300

  • check_hosts_health:

    use: generic-service hostgroup_name: prometheus-hosts service_description: "Nodes_health" check_command: check_prom_alert!K8SNodesNotReady!CRITICAL- One or more nodes are not ready. check_interval: 60

  • check_prometheus_replicas:

    use: notifying_service hostgroup_name: prometheus-hosts service_description: "Prometheus_replica-count" check_command: check_prom_alert_with_labels!replicas_unavailable_statefulset!statefulset="prometheus"!statefulset {statefulset} has lesser than configured replicas check_interval: 60

The Nagios service configurations define the checks Nagios will perform. These checks contain keys for defining: the service type to use, the host group to target, the description of the service check, the command the check should use, and the interval at which to trigger the service check. These services can also be extended to provide additional insight into the overall status of a particular service. These services also allow the ability to define advanced checks for determining the overall health and liveness of a service. For example, a service check could trigger an alarm for the OpenStack services when Nagios detects that the relevant database and message queue has become unresponsive.