Nagios ====== The Nagios chart in openstack-helm-infra can be used to provide an alarming service that's tightly coupled to an OpenStack-Helm deployment. The Nagios chart uses a custom Nagios core image that includes plugins developed to query Prometheus directly for scraped metrics and triggered alarms, query the Ceph manager endpoints directly to determine the health of a Ceph cluster, and to query Elasticsearch for logged events that meet certain criteria (experimental). Authentication -------------- The Nagios deployment includes a sidecar container that runs an Apache reverse proxy to add authentication capabilities for Nagios. The username and password are configured under the nagios entry in the endpoints section of the chart's values.yaml. The configuration for Apache can be found under the conf.httpd key, and uses a helm-toolkit function that allows for including gotpl entries in the template directly. This allows the use of other templates, like the endpoint lookup function templates, directly in the configuration for Apache. Image Plugins ------------- The Nagios image used contains custom plugins that can be used for the defined service check commands. These plugins include: - check_prometheus_metric.py: Query Prometheus for a specific metric and value - check_exporter_health_metric.sh: Nagios plugin to query prometheus exporter - check_rest_get_api.py: Check REST API status - check_update_prometheus_hosts.py: Queries Prometheus, updates Nagios config - query_prometheus_alerts.py: Nagios plugin to query prometheus ALERTS metric More information about the Nagios image and plugins can be found here_. .. _here: https://github.com/att-comdev/nagios Nagios Service Configuration ---------------------------- The Nagios service is configured via the following section in the chart's values file: :: conf: nagios: nagios: log_file: /opt/nagios/var/log/nagios.log cfg_file: - /opt/nagios/etc/nagios_objects.cfg - /opt/nagios/etc/objects/commands.cfg - /opt/nagios/etc/objects/contacts.cfg - /opt/nagios/etc/objects/timeperiods.cfg - /opt/nagios/etc/objects/templates.cfg - /opt/nagios/etc/objects/prometheus_discovery_objects.cfg object_cache_file: /opt/nagios/var/objects.cache precached_object_file: /opt/nagios/var/objects.precache resource_file: /opt/nagios/etc/resource.cfg status_file: /opt/nagios/var/status.dat status_update_interval: 10 nagios_user: nagios nagios_group: nagios check_external_commands: 1 command_file: /opt/nagios/var/rw/nagios.cmd lock_file: /var/run/nagios.lock temp_file: /opt/nagios/var/nagios.tmp temp_path: /tmp event_broker_options: -1 log_rotation_method: d log_archive_path: /opt/nagios/var/log/archives use_syslog: 1 log_service_retries: 1 log_host_retries: 1 log_event_handlers: 1 log_initial_states: 0 log_current_states: 1 log_external_commands: 1 log_passive_checks: 1 service_inter_check_delay_method: s max_service_check_spread: 30 service_interleave_factor: s host_inter_check_delay_method: s max_host_check_spread: 30 max_concurrent_checks: 60 check_result_reaper_frequency: 10 max_check_result_reaper_time: 30 check_result_path: /opt/nagios/var/spool/checkresults max_check_result_file_age: 3600 cached_host_check_horizon: 15 cached_service_check_horizon: 15 enable_predictive_host_dependency_checks: 1 enable_predictive_service_dependency_checks: 1 soft_state_dependencies: 0 auto_reschedule_checks: 0 auto_rescheduling_interval: 30 auto_rescheduling_window: 180 service_check_timeout: 60 host_check_timeout: 60 event_handler_timeout: 60 notification_timeout: 60 ocsp_timeout: 5 perfdata_timeout: 5 retain_state_information: 1 state_retention_file: /opt/nagios/var/retention.dat retention_update_interval: 60 use_retained_program_state: 1 use_retained_scheduling_info: 1 retained_host_attribute_mask: 0 retained_service_attribute_mask: 0 retained_process_host_attribute_mask: 0 retained_process_service_attribute_mask: 0 retained_contact_host_attribute_mask: 0 retained_contact_service_attribute_mask: 0 interval_length: 1 check_workers: 4 check_for_updates: 1 bare_update_check: 0 use_aggressive_host_checking: 0 execute_service_checks: 1 accept_passive_service_checks: 1 execute_host_checks: 1 accept_passive_host_checks: 1 enable_notifications: 1 enable_event_handlers: 1 process_performance_data: 0 obsess_over_services: 0 obsess_over_hosts: 0 translate_passive_host_checks: 0 passive_host_checks_are_soft: 0 check_for_orphaned_services: 1 check_for_orphaned_hosts: 1 check_service_freshness: 1 service_freshness_check_interval: 60 check_host_freshness: 0 host_freshness_check_interval: 60 additional_freshness_latency: 15 enable_flap_detection: 1 low_service_flap_threshold: 5.0 high_service_flap_threshold: 20.0 low_host_flap_threshold: 5.0 high_host_flap_threshold: 20.0 date_format: us use_regexp_matching: 1 use_true_regexp_matching: 0 daemon_dumps_core: 0 use_large_installation_tweaks: 0 enable_environment_macros: 0 debug_level: 0 debug_verbosity: 1 debug_file: /opt/nagios/var/nagios.debug max_debug_file_size: 1000000 allow_empty_hostgroup_assignment: 1 illegal_macro_output_chars: "`~$&|'<>\"" Nagios CGI Configuration ------------------------ The Nagios CGI configuration is defined via the following section in the chart's values file: :: conf: nagios: cgi: main_config_file: /opt/nagios/etc/nagios.cfg physical_html_path: /opt/nagios/share url_html_path: /nagios show_context_help: 0 use_pending_states: 1 use_authentication: 0 use_ssl_authentication: 0 authorized_for_system_information: "*" authorized_for_configuration_information: "*" authorized_for_system_commands: nagiosadmin authorized_for_all_services: "*" authorized_for_all_hosts: "*" authorized_for_all_service_commands: "*" authorized_for_all_host_commands: "*" default_statuswrl_layout: 4 ping_syntax: /bin/ping -n -U -c 5 $HOSTADDRESS$ refresh_rate: 90 result_limit: 100 escape_html_tags: 1 action_url_target: _blank notes_url_target: _blank lock_author_names: 1 navbar_search_for_addresses: 1 navbar_search_for_aliases: 1 Nagios Host Configuration ------------------------- The Nagios chart includes a single host definition for the Prometheus instance queried for metrics. The host definition can be found under the following values key: :: conf: nagios: hosts: - prometheus: use: linux-server host_name: prometheus alias: "Prometheus Monitoring" address: 127.0.0.1 hostgroups: prometheus-hosts check_command: check-prometheus-host-alive The address for the Prometheus host is defined by the PROMETHEUS_SERVICE environment variable in the deployment template, which is determined by the monitoring entry in the Nagios chart's endpoints section. The endpoint is then available as a macro for Nagios to use in all Prometheus based queries. For example: :: - check_prometheus_host_alive: command_name: check-prometheus-host-alive command_line: "$USER1$/check_rest_get_api.py --url $USER2$ --warning_response_seconds 5 --critical_response_seconds 10" The $USER2$ macro above corresponds to the Prometheus endpoint defined in the PROMETHEUS_SERVICE environment variable. All checks that use the prometheus-hosts hostgroup will map back to the Prometheus host defined by this endpoint. Nagios HostGroup Configuration ------------------------------ The Nagios chart includes configuration values for defined host groups under the following values key: :: conf: nagios: host_groups: - prometheus-hosts: hostgroup_name: prometheus-hosts alias: "Prometheus Virtual Host" - base-os: hostgroup_name: base-os alias: "base-os" These hostgroups are used to define which group of hosts should be targeted by a particular nagios check. An example of a check that targets Prometheus for a specific metric query would be: :: - check_ceph_monitor_quorum: use: notifying_service hostgroup_name: prometheus-hosts service_description: "CEPH_quorum" check_command: check_prom_alert!ceph_monitor_quorum_low!CRITICAL- ceph monitor quorum does not exist!OK- ceph monitor quorum exists check_interval: 60 An example of a check that targets all hosts for a base-os type check (memory usage, latency, etc) would be: :: - check_memory_usage: use: notifying_service service_description: Memory_usage check_command: check_memory_usage hostgroup_name: base-os These two host groups allow for a wide range of targeted checks for determining the status of all components of an OpenStack-Helm deployment. Nagios Command Configuration ---------------------------- The Nagios chart includes configuration values for the command definitions Nagios will use when executing service checks. These values are found under the following key: :: conf: nagios: commands: - send_service_snmp_trap: command_name: send_service_snmp_trap command_line: "$USER1$/send_service_trap.sh '$USER8$' '$HOSTNAME$' '$SERVICEDESC$' $SERVICESTATEID$ '$SERVICEOUTPUT$' '$USER4$' '$USER5$'" - send_host_snmp_trap: command_name: send_host_snmp_trap command_line: "$USER1$/send_host_trap.sh '$USER8$' '$HOSTNAME$' $HOSTSTATEID$ '$HOSTOUTPUT$' '$USER4$' '$USER5$'" - send_service_http_post: command_name: send_service_http_post command_line: "$USER1$/send_http_post_event.py --type service --hostname '$HOSTNAME$' --servicedesc '$SERVICEDESC$' --state_id $SERVICESTATEID$ --output '$SERVICEOUTPUT$' --monitoring_hostname '$HOSTNAME$' --primary_url '$USER6$' --secondary_url '$USER7$'" - send_host_http_post: command_name: send_host_http_post command_line: "$USER1$/send_http_post_event.py --type host --hostname '$HOSTNAME$' --state_id $HOSTSTATEID$ --output '$HOSTOUTPUT$' --monitoring_hostname '$HOSTNAME$' --primary_url '$USER6$' --secondary_url '$USER7$'" - check_prometheus_host_alive: command_name: check-prometheus-host-alive command_line: "$USER1$/check_rest_get_api.py --url $USER2$ --warning_response_seconds 5 --critical_response_seconds 10" The list of defined commands can be modified with configuration overrides, which allows for the ability define commands specific to an infrastructure deployment. These commands can include querying Prometheus for metrics on dependencies for a service to determine whether an alert should be raised, executing checks on each host to determine network latency or file system usage, or checking each node for issues with ntp clock skew. Note: Since the conf.nagios.commands key contains a list of the defined commands, the entire contents of conf.nagios.commands will need to be overridden if additional commands are desired (due to the immutable nature of lists). Nagios Service Check Configuration ---------------------------------- The Nagios chart includes configuration values for the service checks Nagios will execute. These service check commands can be found under the following key: :: conf: nagios: services: - notifying_service: name: notifying_service use: generic-service flap_detection_enabled: 0 process_perf_data: 0 contact_groups: snmp_and_http_notifying_contact_group check_interval: 60 notification_interval: 120 retry_interval: 30 register: 0 - check_ceph_health: use: notifying_service hostgroup_name: base-os service_description: "CEPH_health" check_command: check_ceph_health check_interval: 300 - check_hosts_health: use: generic-service hostgroup_name: prometheus-hosts service_description: "Nodes_health" check_command: check_prom_alert!K8SNodesNotReady!CRITICAL- One or more nodes are not ready. check_interval: 60 - check_prometheus_replicas: use: notifying_service hostgroup_name: prometheus-hosts service_description: "Prometheus_replica-count" check_command: check_prom_alert_with_labels!replicas_unavailable_statefulset!statefulset="prometheus"!statefulset {statefulset} has lesser than configured replicas check_interval: 60 The Nagios service configurations define the checks Nagios will perform. These checks contain keys for defining: the service type to use, the host group to target, the description of the service check, the command the check should use, and the interval at which to trigger the service check. These services can also be extended to provide additional insight into the overall status of a particular service. These services also allow the ability to define advanced checks for determining the overall health and liveness of a service. For example, a service check could trigger an alarm for the OpenStack services when Nagios detects that the relevant database and message queue has become unresponsive.