Spec to deploy Prometheus as a Cacti replacement

Our Cacti server is aging and needs to be replaced. Rather than go through a difficult Cacti upgrade take the opportunity to replace that system with a modern one that gives us more potential functionality. Change-Id: Iee197bc0e8e02007d1fb45464bbadb4c283e96e8
2021-08-10 10:09:11 -07:00 · 2021-08-10 10:09:11 -07:00 · cfc6791522
commit cfc6791522
parent ae010afc6f
2 changed files with 174 additions and 0 deletions
--- a/doc/source/index.rst
+++ b/doc/source/index.rst
@ -66,6 +66,7 @@ section of this index.

   specs/central-auth
   specs/irc
+   specs/prometheus
   specs/storyboard_integration_tests
   specs/storyboard_story_tags
   specs/storyboard_subscription_pub_sub
--- a/specs/prometheus.rst
+++ b/specs/prometheus.rst
@ -0,0 +1,173 @@
+::
+
+  Copyright 2021 Open Infrastructure Foundation
+
+  This work is licensed under a Creative Commons Attribution 3.0
+  Unported License.
+  http://creativecommons.org/licenses/by/3.0/legalcode
+
+========================
+Run a Prometheus Service
+========================
+
+https://storyboard.openstack.org/#!/story/2009228
+
+Our existing systems metric tooling is built around Cacti. Unfortunately,
+this tooling is aging without a great path forward into the future. This
+gives us the opportunity to reevaluate and consider what tools might be
+best leveraged for gathering systems metrics today. Prometheus has grown
+to become a popular tool in this space, is well supported, and allows us
+to gather application metrics for many of the services we already run in
+addition to systems metrics. Let's run a Prometheus instance and start
+replacing Cacti.
+
+Problem Description
+===================
+
+In order to properly size the services we run, debug issues with resource
+limits, and generally ensure the health of our systems we need to collect
+metrics on how they are performing. Historically we have done this with
+Cacti which polls systems via SNMP and collects that information in RRD
+files. Cacti will then render graphs for this RRD data per host over various
+time ranges.
+
+Our Cacti installation is aging and needs to be upgraded. Rather than put
+a bunch of effort into maintaining this older system and modernizing it
+we can jump directly to Prometheus which software like Zuul, Gerrit, and
+Gitea support. This change is likely to require a bit more bootstrapping
+effort, but in the end we will get a much richer set of metrics for
+understanding our systems and software.
+
+Proposed Change
+===============
+
+We will deploy a new server with a large attached volume. We will then run
+Prometheus with docker-compose. We should use the prom/prometheus image
+published to Docker Hub. The large volume will be mounted to provide storage
+for Prometheus' TSDB files.
+
+To collect the system metrics we will use Prometheus' node-exporter tool.
+The upstream for this tool publishes binaries for x86_64 and arm64 systems.
+We will use the published binaries (possibly using a local copy) instead of
+using distro packages because the distro packages are quite old and
+node-exporter has changed metric schemas multiple times until it hit version
+1.0. We use the published binaries instead of docker images because running
+node-exporter in docker is awkward as you have to expose significant system
+resources into the container to properly collect their details. We will need
+to open up node-exporter's publishing port to the new Prometheus server in our
+firewall rules.
+
+Once the base set of services and firewall access are in place we can begin
+to roll out configuration that polls the instances and renders the
+information into sets of graphs per instance. Ideally this will be configured
+automatically for instances in our inventory similar to how sslcertcheck
+works. At this point I'm not sure any of us are Prometheus experts and we
+will not describe what those configs should look like here. Instead we expect
+Prometheus config to ingest metrics per instance, and grafana configs to
+render graphs per instance for that data.
+
+We can leverage our functional testing system to work out what these configs
+should look like, or simply modify them on the new server until we are happy.
+We can get away with making these updates "in production" because the new
+service won't be in production until we are happy with it.
+
+Once we are happy with the results we should collect data side by side in
+both Cacti and Prometheus for one month. We can then compare the two systems
+to ensure the data is accurate and useable. Once we have made this
+determination the old Cacti server can be put into hibernation for historical
+record purposes.
+
+Integrating with services like Zuul, Gerrit, Gerritbot, and Gitea is also
+possible but outside of the scope of this spec. Adding these integrations
+is expected to be straightforward once the Cacti replacement details have
+been sorted out.
+
+Alternatives
+------------
+
+We can keep running Cacti and upgrade it one way or another. The end result
+will be familiar but provide far less functionality.
+
+We can run Prometheus with its SNMP exporter instead of node exporter. The
+upside to this approach is we already know how to collect SNMP data from
+our servers. The downside is that the Prometheus community seems to prefer
+node exporter and there is a bit more tooling around it. We'll probably find
+better support for grafana dashboards and graphs this way. Additionally
+node exporter is able to collect a lot of information that we would have to
+write our own SNMP MIBs for that we otherwise get for free. This is a good
+opporunity to use modern tooling.
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Primary assignee:
+  TBD
+
+Gerrit Topic
+------------
+
+Use Gerrit topic "opendev-prometheus" for all patches related to this spec.
+
+.. code-block:: bash
+
+    git-review -t opendev-prometheus
+
+Work Items
+----------
+
+1. Deploy a new metrics.opendev.org server and update DNS.
+2. Deploy prometheus on the new server with docker-compose.
+3. Deploy node exporter on all of our instances.
+4. Update firewall rules on all of our instances to allow Prometheus polls
+   from the new server.
+5. Configure Prometheus to poll our instances.
+6. Review the results and iterate until we are collecting what we want to
+   collect and it is safe to expose publicly.
+7. Open firewall rules on the new server to expose the Prometheus data
+   externally.
+8. Build grafana dashboards for our instances exposing the metrics in
+   Prometheus.
+
+Repositories
+------------
+
+No new repositories will need to be created. All config should live in
+opendev/system-config.
+
+Servers
+-------
+
+We will create a new metrics.opendev.org server.
+
+DNS Entries
+-----------
+
+Only DNS records for the new server will be created.
+
+Documentation
+-------------
+
+We will update documentation to include information on operating prometheus,
+adding sources of data to prometheus, and adding graph dashboards to grafana
+backed by prometheus.
+
+Security
+--------
+
+We will need to update firewall rules on all systems to allow Prometheus
+polls from the new metrics.opendev.org server.
+
+Testing
+-------
+
+A system-config-run-prometheus job will be added to run prometheus and at
+least one other server that it will gather metrics from. This will ensure
+that node exporter polling and ingestion to prometheus is functional.
+
+Dependencies
+============
+
+None