Merge "Specification: OSH logging, monitoring and alerting"
This commit is contained in:
commit
5fe6055d6b
@ -6,6 +6,7 @@ Contents:
|
|||||||
.. toctree::
|
.. toctree::
|
||||||
:maxdepth: 2
|
:maxdepth: 2
|
||||||
|
|
||||||
|
osh-lma-stack.rst
|
||||||
specifications.rst
|
specifications.rst
|
||||||
template.rst
|
template.rst
|
||||||
neutron-multiple-sdns.rst
|
neutron-multiple-sdns.rst
|
||||||
|
219
doc/source/specs/osh-lma-stack.rst
Normal file
219
doc/source/specs/osh-lma-stack.rst
Normal file
@ -0,0 +1,219 @@
|
|||||||
|
..
|
||||||
|
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||||
|
License.
|
||||||
|
|
||||||
|
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||||
|
|
||||||
|
..
|
||||||
|
|
||||||
|
=====================================
|
||||||
|
OSH Logging, Monitoring, and Alerting
|
||||||
|
=====================================
|
||||||
|
|
||||||
|
Blueprints:
|
||||||
|
1. osh-monitoring_
|
||||||
|
2. osh-logging-framework_
|
||||||
|
|
||||||
|
.. _osh-monitoring: https://blueprints.launchpad.net/openstack-helm/+spec/osh-monitoring
|
||||||
|
.. _osh-logging-framework: https://blueprints.launchpad.net/openstack-helm/+spec/openstack-logging-framework
|
||||||
|
|
||||||
|
|
||||||
|
Problem Description
|
||||||
|
===================
|
||||||
|
|
||||||
|
OpenStack-Helm currently lacks a centralized mechanism for providing insight
|
||||||
|
into the performance of the OpenStack services and infrastructure components.
|
||||||
|
The log formats of the different components in OpenStack-Helm vary, which makes
|
||||||
|
identifying causes for issues difficult across services. To support operational
|
||||||
|
readiness by default, OpenStack-Helm should include components for logging
|
||||||
|
events in a common format, monitoring metrics at all levels, alerting and alarms
|
||||||
|
for those metrics, and visualization tools for querying the logs and metrics in
|
||||||
|
a single pane view.
|
||||||
|
|
||||||
|
|
||||||
|
Platform Requirements
|
||||||
|
=====================
|
||||||
|
|
||||||
|
Logging Requirements
|
||||||
|
--------------------
|
||||||
|
|
||||||
|
The requirements for a logging platform include:
|
||||||
|
|
||||||
|
1. All services in OpenStack-Helm log to stdout and stderr by default
|
||||||
|
2. Log collection daemon runs on each node to forward logs to storage
|
||||||
|
3. Proper directories mounted to retrieve logs from the node
|
||||||
|
4. Ability to apply custom metadata and uniform format to logs
|
||||||
|
5. Time-series database for logs collected
|
||||||
|
6. Backed by highly available storage
|
||||||
|
7. Configurable log rotation mechanism
|
||||||
|
8. Ability to perform custom queries against stored logs
|
||||||
|
9. Single pane visualization capabilities
|
||||||
|
|
||||||
|
Monitoring Requirements
|
||||||
|
-----------------------
|
||||||
|
|
||||||
|
The requirements for a monitoring platform include:
|
||||||
|
|
||||||
|
1. Time-series database for collected metrics
|
||||||
|
2. Backed by highly available storage
|
||||||
|
3. Common method to configure all monitoring targets
|
||||||
|
4. Single pane visualization capabilities
|
||||||
|
5. Ability to perform custom queries against metrics collected
|
||||||
|
6. Alerting capabilities to notify operators when thresholds exceeded
|
||||||
|
|
||||||
|
|
||||||
|
Use Cases
|
||||||
|
=========
|
||||||
|
|
||||||
|
Logging Use Cases
|
||||||
|
-----------------
|
||||||
|
|
||||||
|
Example uses for centralized logging include:
|
||||||
|
|
||||||
|
1. Record compute instance behavior across nodes and services
|
||||||
|
2. Record OpenStack service behavior and status
|
||||||
|
3. Find all backtraces for a tenant id's uuid
|
||||||
|
4. Identify issues with infrastructure components, such as RabbitMQ, mariadb, etc
|
||||||
|
5. Identify issues with Kubernetes components, such as: etcd, CNI, scheduler, etc
|
||||||
|
6. Organizational auditing needs
|
||||||
|
7. Visualize logged events to determine if an event is recurring or an outlier
|
||||||
|
8. Find all logged events that match a pattern (service, pod, behavior, etc)
|
||||||
|
|
||||||
|
Monitoring Use Cases
|
||||||
|
--------------------
|
||||||
|
|
||||||
|
Example OpenStack-Helm metrics requiring monitoring include:
|
||||||
|
|
||||||
|
1. Host utilization: memory usage, CPU usage, disk I/O, network I/O, etc
|
||||||
|
2. Kubernetes metrics: pod status, replica availability, job status, etc
|
||||||
|
3. Ceph metrics: total pool usage, latency, health, etc
|
||||||
|
4. OpenStack metrics: tenants, networks, flavors, floating IPs, quotas, etc
|
||||||
|
5. Proactive monitoring of stack traces across all deployed infrastructure
|
||||||
|
|
||||||
|
Examples of how these metrics can be used include:
|
||||||
|
|
||||||
|
1. Add or remove nodes depending on utilization
|
||||||
|
2. Trigger alerts when desired replicas fall below required number
|
||||||
|
3. Trigger alerts when services become unavailable or unresponsive
|
||||||
|
4. Identify etcd performance that could lead to cluster instability
|
||||||
|
5. Visualize performance to identify trends in traffic or utilization over time
|
||||||
|
|
||||||
|
Proposed Change
|
||||||
|
===============
|
||||||
|
|
||||||
|
Logging
|
||||||
|
-------
|
||||||
|
|
||||||
|
Fluentd, Elasticsearch, and Kibana meet OpenStack-Helm's logging requirements
|
||||||
|
for capture, storage and visualization of logged events. Fluentd runs as a
|
||||||
|
daemonset on each node and mounts the /var/lib/docker/containers directory.
|
||||||
|
The Docker container runtime engine directs events posted to stdout and stderr
|
||||||
|
to this directory on the host. Fluentd should then declare the contents of
|
||||||
|
that directory as an input stream, and use the fluent-plugin-elasticsearch
|
||||||
|
plugin to apply the Logstash format to the logs. Fluentd will also use the
|
||||||
|
fluentd-plugin-kubernetes-metadata plugin to write Kubernetes metadata to the
|
||||||
|
log record. Fluentd will then forward the results to Elasticsearch, which
|
||||||
|
indexes the logs in a logstash-* index by default. The resulting logs can then
|
||||||
|
be queried directly through Elasticsearch, or they can be viewed via Kibana.
|
||||||
|
Kibana offers a dashboard that can create custom views on logged events, and
|
||||||
|
Kibana integrates well with Elasticsearch by default.
|
||||||
|
|
||||||
|
The proposal includes the following:
|
||||||
|
|
||||||
|
1. Helm chart for Fluentd
|
||||||
|
2. Helm chart for Elasticsearch
|
||||||
|
3. Helm chart for Kibana
|
||||||
|
|
||||||
|
All three charts must include sensible configuration values to make the
|
||||||
|
logging platform usable by default. These include: proper input configurations
|
||||||
|
for Fluentd, proper metadata and formats applied to the logs via Fluentd,
|
||||||
|
sensible indexes created for Elasticsearch, and proper configuration values for
|
||||||
|
Kibana to query the Elasticsearch indexes previously created.
|
||||||
|
|
||||||
|
Monitoring
|
||||||
|
----------
|
||||||
|
|
||||||
|
Prometheus and Grafana meet OpenStack-Helm's monitoring requirements. The
|
||||||
|
Prometheus monitoring tool provides the ability to scrape targets for metrics
|
||||||
|
over HTTP, and it stores these metrics in Prometheus's time-series database.
|
||||||
|
The monitoring targets can be discovered via static configuration in Prometheus
|
||||||
|
or through service discovery. Prometheus includes a querying language that
|
||||||
|
provides meaningful queries against the metrics gathered and supports the
|
||||||
|
creation of rules to measure these metrics against for alerting purposes. It
|
||||||
|
also supports a wide range of Prometheus exporters for existing services,
|
||||||
|
including Ceph and OpenStack. Grafana supports Prometheus as a data source, and
|
||||||
|
provides the ability to view the metrics gathered by Prometheus in a single pane
|
||||||
|
dashboard. Grafana can be bootstrapped with dashboards for each target scraped,
|
||||||
|
or the dashboards can be added via Grafana's web interface directly. To meet
|
||||||
|
OpenStack-Helm's alerting needs, Alertmanager can be used to interface with
|
||||||
|
Prometheus and send alerts based on Prometheus rule evaluations.
|
||||||
|
|
||||||
|
The proposal includes the following:
|
||||||
|
|
||||||
|
1. Helm chart for Prometheus
|
||||||
|
2. Helm chart for Alertmanager
|
||||||
|
3. Helm chart for Grafana
|
||||||
|
4. Helm charts for any appropriate Prometheus exporters
|
||||||
|
|
||||||
|
All charts must include sensible configuration values to make the monitoring
|
||||||
|
platform usable by default. These include: static Prometheus configurations
|
||||||
|
for the included exporters, static dashboards for Grafana mounted via configMaps
|
||||||
|
and configurations for Alertmanager out of the box.
|
||||||
|
|
||||||
|
Security Impact
|
||||||
|
---------------
|
||||||
|
|
||||||
|
All services running within the platform should be subject to the
|
||||||
|
security practices applied to the other OpenStack-Helm charts.
|
||||||
|
|
||||||
|
Performance Impact
|
||||||
|
------------------
|
||||||
|
|
||||||
|
To minimize the performance impacts, the following should be considered:
|
||||||
|
|
||||||
|
1. Sane defaults for log retention and rotation policies
|
||||||
|
2. Identify opportunities for improving Prometheus's operation over time
|
||||||
|
3. Elasticsearch configured to prevent memory swapping to disk
|
||||||
|
4. Elasticsearch configured in a highly available manner with sane defaults
|
||||||
|
|
||||||
|
|
||||||
|
Implementation
|
||||||
|
==============
|
||||||
|
|
||||||
|
Assignee(s)
|
||||||
|
-----------
|
||||||
|
|
||||||
|
Primary assignees:
|
||||||
|
srwilker (Steve Wilkerson)
|
||||||
|
portdirect (Pete Birley)
|
||||||
|
lr699s (Larry Rensing)
|
||||||
|
|
||||||
|
|
||||||
|
Work Items
|
||||||
|
----------
|
||||||
|
|
||||||
|
1. Fluentd chart
|
||||||
|
2. Elasticsearch chart
|
||||||
|
3. Kibana chart
|
||||||
|
4. Prometheus chart
|
||||||
|
5. Alertmanager chart
|
||||||
|
6. Grafana chart
|
||||||
|
7. Charts for exporters: kube-state-metrics, ceph-exporter, openstack-exporter?
|
||||||
|
|
||||||
|
All charts should follow design approaches applied to all other OpenStack-Helm
|
||||||
|
charts, including the use of helm-toolkit.
|
||||||
|
|
||||||
|
All charts require valid and sensible default values to provide operational
|
||||||
|
value out of the box.
|
||||||
|
|
||||||
|
Testing
|
||||||
|
=======
|
||||||
|
Testing should include Helm tests for each of the included charts as well as an
|
||||||
|
integration test in the gate.
|
||||||
|
|
||||||
|
|
||||||
|
Documentation Impact
|
||||||
|
====================
|
||||||
|
Documentation should be included for each of the included charts as well as
|
||||||
|
documentation detailing the requirements for a usable monitoring platform,
|
||||||
|
preferably with sane default values out of the box.
|
Loading…
x
Reference in New Issue
Block a user