Spec to deploy Prometheus as a Cacti replacement

Our Cacti server is aging and needs to be replaced. Rather than go through a difficult Cacti upgrade take the opportunity to replace that system with a modern one that gives us more potential functionality. Change-Id: Iee197bc0e8e02007d1fb45464bbadb4c283e96e8
2021-08-10 10:09:11 -07:00 · 2021-08-10 10:09:11 -07:00 · cfc6791522
commit cfc6791522
parent ae010afc6f
2 changed files with 174 additions and 0 deletions
--- a/doc/source/index.rst
+++ b/doc/source/index.rst
@ -66,6 +66,7 @@ section of this index.
   specs/central-auth
   specs/irc
   specs/prometheus
   specs/storyboard_integration_tests
   specs/storyboard_story_tags
   specs/storyboard_subscription_pub_sub
--- a/specs/prometheus.rst
+++ b/specs/prometheus.rst
@ -0,0 +1,173 @@
 ::
  Copyright 2021 Open Infrastructure Foundation
  This work is licensed under a Creative Commons Attribution 3.0
  Unported License.
  http://creativecommons.org/licenses/by/3.0/legalcode
 ========================
 Run a Prometheus Service
 ========================
 https://storyboard.openstack.org/#!/story/2009228
 Our existing systems metric tooling is built around Cacti. Unfortunately,
 this tooling is aging without a great path forward into the future. This
 gives us the opportunity to reevaluate and consider what tools might be
 best leveraged for gathering systems metrics today. Prometheus has grown
 to become a popular tool in this space, is well supported, and allows us
 to gather application metrics for many of the services we already run in
 addition to systems metrics. Let's run a Prometheus instance and start
 replacing Cacti.
 Problem Description
 ===================
 In order to properly size the services we run, debug issues with resource
 limits, and generally ensure the health of our systems we need to collect
 metrics on how they are performing. Historically we have done this with
 Cacti which polls systems via SNMP and collects that information in RRD
 files. Cacti will then render graphs for this RRD data per host over various
 time ranges.
 Our Cacti installation is aging and needs to be upgraded. Rather than put
 a bunch of effort into maintaining this older system and modernizing it
 we can jump directly to Prometheus which software like Zuul, Gerrit, and
 Gitea support. This change is likely to require a bit more bootstrapping
 effort, but in the end we will get a much richer set of metrics for
 understanding our systems and software.
 Proposed Change
 ===============
 We will deploy a new server with a large attached volume. We will then run
 Prometheus with docker-compose. We should use the prom/prometheus image
 published to Docker Hub. The large volume will be mounted to provide storage
 for Prometheus' TSDB files.
 To collect the system metrics we will use Prometheus' node-exporter tool.
 The upstream for this tool publishes binaries for x86_64 and arm64 systems.
 We will use the published binaries (possibly using a local copy) instead of
 using distro packages because the distro packages are quite old and
 node-exporter has changed metric schemas multiple times until it hit version
 1.0. We use the published binaries instead of docker images because running
 node-exporter in docker is awkward as you have to expose significant system
 resources into the container to properly collect their details. We will need
 to open up node-exporter's publishing port to the new Prometheus server in our
 firewall rules.
 Once the base set of services and firewall access are in place we can begin
 to roll out configuration that polls the instances and renders the
 information into sets of graphs per instance. Ideally this will be configured
 automatically for instances in our inventory similar to how sslcertcheck
 works. At this point I'm not sure any of us are Prometheus experts and we
 will not describe what those configs should look like here. Instead we expect
 Prometheus config to ingest metrics per instance, and grafana configs to
 render graphs per instance for that data.
 We can leverage our functional testing system to work out what these configs
 should look like, or simply modify them on the new server until we are happy.
 We can get away with making these updates "in production" because the new
 service won't be in production until we are happy with it.
 Once we are happy with the results we should collect data side by side in
 both Cacti and Prometheus for one month. We can then compare the two systems
 to ensure the data is accurate and useable. Once we have made this
 determination the old Cacti server can be put into hibernation for historical
 record purposes.
 Integrating with services like Zuul, Gerrit, Gerritbot, and Gitea is also
 possible but outside of the scope of this spec. Adding these integrations
 is expected to be straightforward once the Cacti replacement details have
 been sorted out.
 Alternatives
 ------------
 We can keep running Cacti and upgrade it one way or another. The end result
 will be familiar but provide far less functionality.
 We can run Prometheus with its SNMP exporter instead of node exporter. The
 upside to this approach is we already know how to collect SNMP data from
 our servers. The downside is that the Prometheus community seems to prefer
 node exporter and there is a bit more tooling around it. We'll probably find
 better support for grafana dashboards and graphs this way. Additionally
 node exporter is able to collect a lot of information that we would have to
 write our own SNMP MIBs for that we otherwise get for free. This is a good
 opporunity to use modern tooling.
 Implementation
 ==============
 Assignee(s)
 -----------
 Primary assignee:
  TBD
 Gerrit Topic
 ------------
 Use Gerrit topic "opendev-prometheus" for all patches related to this spec.
 .. code-block:: bash
    git-review -t opendev-prometheus
 Work Items
 ----------
 1. Deploy a new metrics.opendev.org server and update DNS.
 2. Deploy prometheus on the new server with docker-compose.
 3. Deploy node exporter on all of our instances.
 4. Update firewall rules on all of our instances to allow Prometheus polls
   from the new server.
 5. Configure Prometheus to poll our instances.
 6. Review the results and iterate until we are collecting what we want to
   collect and it is safe to expose publicly.
 7. Open firewall rules on the new server to expose the Prometheus data
   externally.
 8. Build grafana dashboards for our instances exposing the metrics in
   Prometheus.
 Repositories
 ------------
 No new repositories will need to be created. All config should live in
 opendev/system-config.
 Servers
 -------
 We will create a new metrics.opendev.org server.
 DNS Entries
 -----------
 Only DNS records for the new server will be created.
 Documentation
 -------------
 We will update documentation to include information on operating prometheus,
 adding sources of data to prometheus, and adding graph dashboards to grafana
 backed by prometheus.
 Security
 --------
 We will need to update firewall rules on all systems to allow Prometheus
 polls from the new metrics.opendev.org server.
 Testing
 -------
 A system-config-run-prometheus job will be added to run prometheus and at
 least one other server that it will gather metrics from. This will ensure
 that node exporter polling and ingestion to prometheus is functional.
 Dependencies
 ============
 None