Merge "Specification: OSH logging, monitoring and alerting"

2017-10-05 11:14:33 +00:00 · 2017-10-05 11:14:33 +00:00 · 5fe6055d6b
commit 5fe6055d6b
parent d0ab75afa3 c30935ca08
2 changed files with 220 additions and 0 deletions
--- a/doc/source/specs/index.rst
+++ b/doc/source/specs/index.rst
@ -6,6 +6,7 @@ Contents:
 .. toctree::
   :maxdepth: 2

+   osh-lma-stack.rst
   specifications.rst
   template.rst
   neutron-multiple-sdns.rst
--- a/doc/source/specs/osh-lma-stack.rst
+++ b/doc/source/specs/osh-lma-stack.rst
@ -0,0 +1,219 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+..
+
+=====================================
+OSH Logging, Monitoring, and Alerting
+=====================================
+
+Blueprints:
+1. osh-monitoring_
+2. osh-logging-framework_
+
+.. _osh-monitoring: https://blueprints.launchpad.net/openstack-helm/+spec/osh-monitoring
+.. _osh-logging-framework: https://blueprints.launchpad.net/openstack-helm/+spec/openstack-logging-framework
+
+
+Problem Description
+===================
+
+OpenStack-Helm currently lacks a centralized mechanism for providing insight
+into the performance of the OpenStack services and infrastructure components.
+The log formats of the different components in OpenStack-Helm vary, which makes
+identifying causes for issues difficult across services.  To support operational
+readiness by default, OpenStack-Helm should include components for logging
+events in a common format, monitoring metrics at all levels, alerting and alarms
+for those metrics, and visualization tools for querying the logs and metrics in
+a single pane view.
+
+
+Platform Requirements
+=====================
+
+Logging Requirements
+--------------------
+
+The requirements for a logging platform include:
+
+1. All services in OpenStack-Helm log to stdout and stderr by default
+2. Log collection daemon runs on each node to forward logs to storage
+3. Proper directories mounted to retrieve logs from the node
+4. Ability to apply custom metadata and uniform format to logs
+5. Time-series database for logs collected
+6. Backed by highly available storage
+7. Configurable log rotation mechanism
+8. Ability to perform custom queries against stored logs
+9. Single pane visualization capabilities
+
+Monitoring Requirements
+-----------------------
+
+The requirements for a monitoring platform include:
+
+1. Time-series database for collected metrics
+2. Backed by highly available storage
+3. Common method to configure all monitoring targets
+4. Single pane visualization capabilities
+5. Ability to perform custom queries against metrics collected
+6. Alerting capabilities to notify operators when thresholds exceeded
+
+
+Use Cases
+=========
+
+Logging Use Cases
+-----------------
+
+Example uses for centralized logging include:
+
+1. Record compute instance behavior across nodes and services
+2. Record OpenStack service behavior and status
+3. Find all backtraces for a tenant id's uuid
+4. Identify issues with infrastructure components, such as RabbitMQ, mariadb, etc
+5. Identify issues with Kubernetes components, such as: etcd, CNI, scheduler, etc
+6. Organizational auditing needs
+7. Visualize logged events to determine if an event is recurring or an outlier
+8. Find all logged events that match a pattern (service, pod, behavior, etc)
+
+Monitoring Use Cases
+--------------------
+
+Example OpenStack-Helm metrics requiring monitoring include:
+
+1. Host utilization: memory usage, CPU usage, disk I/O, network I/O, etc
+2. Kubernetes metrics: pod status, replica availability, job status, etc
+3. Ceph metrics: total pool usage, latency, health, etc
+4. OpenStack metrics: tenants, networks, flavors, floating IPs, quotas, etc
+5. Proactive monitoring of stack traces across all deployed infrastructure
+
+Examples of how these metrics can be used include:
+
+1. Add or remove nodes depending on utilization
+2. Trigger alerts when desired replicas fall below required number
+3. Trigger alerts when services become unavailable or unresponsive
+4. Identify etcd performance that could lead to cluster instability
+5. Visualize performance to identify trends in traffic or utilization over time
+
+Proposed Change
+===============
+
+Logging
+-------
+
+Fluentd, Elasticsearch, and Kibana meet OpenStack-Helm's logging requirements
+for capture, storage and visualization of logged events.  Fluentd runs as a
+daemonset on each node and mounts the /var/lib/docker/containers directory.
+The Docker container runtime engine directs events posted to stdout and stderr
+to this directory on the host.  Fluentd should then declare the contents of
+that directory as an input stream, and use the fluent-plugin-elasticsearch
+plugin to apply the Logstash format to the logs.  Fluentd will also use the
+fluentd-plugin-kubernetes-metadata plugin to write Kubernetes metadata to the
+log record.  Fluentd will then forward the results to Elasticsearch, which
+indexes the logs in a logstash-* index by default.  The resulting logs can then
+be queried directly through Elasticsearch, or they can be viewed via Kibana.
+Kibana offers a dashboard that can create custom views on logged events, and
+Kibana integrates well with Elasticsearch by default.
+
+The proposal includes the following:
+
+1. Helm chart for Fluentd
+2. Helm chart for Elasticsearch
+3. Helm chart for Kibana
+
+All three charts must include sensible configuration values to make the
+logging platform usable by default.  These include: proper input configurations
+for Fluentd, proper metadata and formats applied to the logs via Fluentd,
+sensible indexes created for Elasticsearch, and proper configuration values for
+Kibana to query the Elasticsearch indexes previously created.
+
+Monitoring
+----------
+
+Prometheus and Grafana meet OpenStack-Helm's monitoring requirements.  The
+Prometheus monitoring tool provides the ability to scrape targets for metrics
+over HTTP, and it stores these metrics in Prometheus's time-series database.
+The monitoring targets can be discovered via static configuration in Prometheus
+or through service discovery.  Prometheus includes a querying language that
+provides meaningful queries against the metrics gathered and supports the
+creation of rules to measure these metrics against for alerting purposes.  It
+also supports a wide range of Prometheus exporters for existing services,
+including Ceph and OpenStack.  Grafana supports Prometheus as a data source, and
+provides the ability to view the metrics gathered by Prometheus in a single pane
+dashboard.  Grafana can be bootstrapped with dashboards for each target scraped,
+or the dashboards can be added via Grafana's web interface directly.  To meet
+OpenStack-Helm's alerting needs, Alertmanager can be used to interface with
+Prometheus and send alerts based on Prometheus rule evaluations.
+
+The proposal includes the following:
+
+1. Helm chart for Prometheus
+2. Helm chart for Alertmanager
+3. Helm chart for Grafana
+4. Helm charts for any appropriate Prometheus exporters
+
+All charts must include sensible configuration values to make the monitoring
+platform usable by default.  These include:  static Prometheus configurations
+for the included exporters, static dashboards for Grafana mounted via configMaps
+and configurations for Alertmanager out of the box.
+
+Security Impact
+---------------
+
+All services running within the platform should be subject to the
+security practices applied to the other OpenStack-Helm charts.
+
+Performance Impact
+------------------
+
+To minimize the performance impacts, the following should be considered:
+
+1. Sane defaults for log retention and rotation policies
+2. Identify opportunities for improving Prometheus's operation over time
+3. Elasticsearch configured to prevent memory swapping to disk
+4. Elasticsearch configured in a highly available manner with sane defaults
+
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Primary assignees:
+  srwilker (Steve Wilkerson)
+  portdirect (Pete Birley)
+  lr699s (Larry Rensing)
+
+
+Work Items
+----------
+
+1. Fluentd chart
+2. Elasticsearch chart
+3. Kibana chart
+4. Prometheus chart
+5. Alertmanager chart
+6. Grafana chart
+7. Charts for exporters: kube-state-metrics, ceph-exporter, openstack-exporter?
+
+All charts should follow design approaches applied to all other OpenStack-Helm
+charts, including the use of helm-toolkit.
+
+All charts require valid and sensible default values to provide operational
+value out of the box.
+
+Testing
+=======
+Testing should include Helm tests for each of the included charts as well as an
+integration test in the gate.
+
+
+Documentation Impact
+====================
+Documentation should be included for each of the included charts as well as
+documentation detailing the requirements for a usable monitoring platform,
+preferably with sane default values out of the box.