Spec to deploy Prometheus as a Cacti replacement
Our Cacti server is aging and needs to be replaced. Rather than go through a difficult Cacti upgrade take the opportunity to replace that system with a modern one that gives us more potential functionality. Change-Id: Iee197bc0e8e02007d1fb45464bbadb4c283e96e8
This commit is contained in:
parent
ae010afc6f
commit
cfc6791522
@ -66,6 +66,7 @@ section of this index.
|
|||||||
|
|
||||||
specs/central-auth
|
specs/central-auth
|
||||||
specs/irc
|
specs/irc
|
||||||
|
specs/prometheus
|
||||||
specs/storyboard_integration_tests
|
specs/storyboard_integration_tests
|
||||||
specs/storyboard_story_tags
|
specs/storyboard_story_tags
|
||||||
specs/storyboard_subscription_pub_sub
|
specs/storyboard_subscription_pub_sub
|
||||||
|
173
specs/prometheus.rst
Normal file
173
specs/prometheus.rst
Normal file
@ -0,0 +1,173 @@
|
|||||||
|
::
|
||||||
|
|
||||||
|
Copyright 2021 Open Infrastructure Foundation
|
||||||
|
|
||||||
|
This work is licensed under a Creative Commons Attribution 3.0
|
||||||
|
Unported License.
|
||||||
|
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||||
|
|
||||||
|
========================
|
||||||
|
Run a Prometheus Service
|
||||||
|
========================
|
||||||
|
|
||||||
|
https://storyboard.openstack.org/#!/story/2009228
|
||||||
|
|
||||||
|
Our existing systems metric tooling is built around Cacti. Unfortunately,
|
||||||
|
this tooling is aging without a great path forward into the future. This
|
||||||
|
gives us the opportunity to reevaluate and consider what tools might be
|
||||||
|
best leveraged for gathering systems metrics today. Prometheus has grown
|
||||||
|
to become a popular tool in this space, is well supported, and allows us
|
||||||
|
to gather application metrics for many of the services we already run in
|
||||||
|
addition to systems metrics. Let's run a Prometheus instance and start
|
||||||
|
replacing Cacti.
|
||||||
|
|
||||||
|
Problem Description
|
||||||
|
===================
|
||||||
|
|
||||||
|
In order to properly size the services we run, debug issues with resource
|
||||||
|
limits, and generally ensure the health of our systems we need to collect
|
||||||
|
metrics on how they are performing. Historically we have done this with
|
||||||
|
Cacti which polls systems via SNMP and collects that information in RRD
|
||||||
|
files. Cacti will then render graphs for this RRD data per host over various
|
||||||
|
time ranges.
|
||||||
|
|
||||||
|
Our Cacti installation is aging and needs to be upgraded. Rather than put
|
||||||
|
a bunch of effort into maintaining this older system and modernizing it
|
||||||
|
we can jump directly to Prometheus which software like Zuul, Gerrit, and
|
||||||
|
Gitea support. This change is likely to require a bit more bootstrapping
|
||||||
|
effort, but in the end we will get a much richer set of metrics for
|
||||||
|
understanding our systems and software.
|
||||||
|
|
||||||
|
Proposed Change
|
||||||
|
===============
|
||||||
|
|
||||||
|
We will deploy a new server with a large attached volume. We will then run
|
||||||
|
Prometheus with docker-compose. We should use the prom/prometheus image
|
||||||
|
published to Docker Hub. The large volume will be mounted to provide storage
|
||||||
|
for Prometheus' TSDB files.
|
||||||
|
|
||||||
|
To collect the system metrics we will use Prometheus' node-exporter tool.
|
||||||
|
The upstream for this tool publishes binaries for x86_64 and arm64 systems.
|
||||||
|
We will use the published binaries (possibly using a local copy) instead of
|
||||||
|
using distro packages because the distro packages are quite old and
|
||||||
|
node-exporter has changed metric schemas multiple times until it hit version
|
||||||
|
1.0. We use the published binaries instead of docker images because running
|
||||||
|
node-exporter in docker is awkward as you have to expose significant system
|
||||||
|
resources into the container to properly collect their details. We will need
|
||||||
|
to open up node-exporter's publishing port to the new Prometheus server in our
|
||||||
|
firewall rules.
|
||||||
|
|
||||||
|
Once the base set of services and firewall access are in place we can begin
|
||||||
|
to roll out configuration that polls the instances and renders the
|
||||||
|
information into sets of graphs per instance. Ideally this will be configured
|
||||||
|
automatically for instances in our inventory similar to how sslcertcheck
|
||||||
|
works. At this point I'm not sure any of us are Prometheus experts and we
|
||||||
|
will not describe what those configs should look like here. Instead we expect
|
||||||
|
Prometheus config to ingest metrics per instance, and grafana configs to
|
||||||
|
render graphs per instance for that data.
|
||||||
|
|
||||||
|
We can leverage our functional testing system to work out what these configs
|
||||||
|
should look like, or simply modify them on the new server until we are happy.
|
||||||
|
We can get away with making these updates "in production" because the new
|
||||||
|
service won't be in production until we are happy with it.
|
||||||
|
|
||||||
|
Once we are happy with the results we should collect data side by side in
|
||||||
|
both Cacti and Prometheus for one month. We can then compare the two systems
|
||||||
|
to ensure the data is accurate and useable. Once we have made this
|
||||||
|
determination the old Cacti server can be put into hibernation for historical
|
||||||
|
record purposes.
|
||||||
|
|
||||||
|
Integrating with services like Zuul, Gerrit, Gerritbot, and Gitea is also
|
||||||
|
possible but outside of the scope of this spec. Adding these integrations
|
||||||
|
is expected to be straightforward once the Cacti replacement details have
|
||||||
|
been sorted out.
|
||||||
|
|
||||||
|
Alternatives
|
||||||
|
------------
|
||||||
|
|
||||||
|
We can keep running Cacti and upgrade it one way or another. The end result
|
||||||
|
will be familiar but provide far less functionality.
|
||||||
|
|
||||||
|
We can run Prometheus with its SNMP exporter instead of node exporter. The
|
||||||
|
upside to this approach is we already know how to collect SNMP data from
|
||||||
|
our servers. The downside is that the Prometheus community seems to prefer
|
||||||
|
node exporter and there is a bit more tooling around it. We'll probably find
|
||||||
|
better support for grafana dashboards and graphs this way. Additionally
|
||||||
|
node exporter is able to collect a lot of information that we would have to
|
||||||
|
write our own SNMP MIBs for that we otherwise get for free. This is a good
|
||||||
|
opporunity to use modern tooling.
|
||||||
|
|
||||||
|
Implementation
|
||||||
|
==============
|
||||||
|
|
||||||
|
Assignee(s)
|
||||||
|
-----------
|
||||||
|
|
||||||
|
Primary assignee:
|
||||||
|
TBD
|
||||||
|
|
||||||
|
Gerrit Topic
|
||||||
|
------------
|
||||||
|
|
||||||
|
Use Gerrit topic "opendev-prometheus" for all patches related to this spec.
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
git-review -t opendev-prometheus
|
||||||
|
|
||||||
|
Work Items
|
||||||
|
----------
|
||||||
|
|
||||||
|
1. Deploy a new metrics.opendev.org server and update DNS.
|
||||||
|
2. Deploy prometheus on the new server with docker-compose.
|
||||||
|
3. Deploy node exporter on all of our instances.
|
||||||
|
4. Update firewall rules on all of our instances to allow Prometheus polls
|
||||||
|
from the new server.
|
||||||
|
5. Configure Prometheus to poll our instances.
|
||||||
|
6. Review the results and iterate until we are collecting what we want to
|
||||||
|
collect and it is safe to expose publicly.
|
||||||
|
7. Open firewall rules on the new server to expose the Prometheus data
|
||||||
|
externally.
|
||||||
|
8. Build grafana dashboards for our instances exposing the metrics in
|
||||||
|
Prometheus.
|
||||||
|
|
||||||
|
Repositories
|
||||||
|
------------
|
||||||
|
|
||||||
|
No new repositories will need to be created. All config should live in
|
||||||
|
opendev/system-config.
|
||||||
|
|
||||||
|
Servers
|
||||||
|
-------
|
||||||
|
|
||||||
|
We will create a new metrics.opendev.org server.
|
||||||
|
|
||||||
|
DNS Entries
|
||||||
|
-----------
|
||||||
|
|
||||||
|
Only DNS records for the new server will be created.
|
||||||
|
|
||||||
|
Documentation
|
||||||
|
-------------
|
||||||
|
|
||||||
|
We will update documentation to include information on operating prometheus,
|
||||||
|
adding sources of data to prometheus, and adding graph dashboards to grafana
|
||||||
|
backed by prometheus.
|
||||||
|
|
||||||
|
Security
|
||||||
|
--------
|
||||||
|
|
||||||
|
We will need to update firewall rules on all systems to allow Prometheus
|
||||||
|
polls from the new metrics.opendev.org server.
|
||||||
|
|
||||||
|
Testing
|
||||||
|
-------
|
||||||
|
|
||||||
|
A system-config-run-prometheus job will be added to run prometheus and at
|
||||||
|
least one other server that it will gather metrics from. This will ensure
|
||||||
|
that node exporter polling and ingestion to prometheus is functional.
|
||||||
|
|
||||||
|
Dependencies
|
||||||
|
============
|
||||||
|
|
||||||
|
None
|
Loading…
Reference in New Issue
Block a user