e589f986d8
We moved publishing of this repo to https://docs.opendev.org/opendev/infra-specs/latest/, update all links. Also, remove unused import of autodoc from doc/source/conf.py Change-Id: I156b12d0e49362563cd8c39f30475bafe889c23b
466 lines
20 KiB
ReStructuredText
466 lines
20 KiB
ReStructuredText
::
|
|
|
|
Copyright 2018 Red Hat, Inc.
|
|
|
|
This work is licensed under a Creative Commons Attribution 3.0
|
|
Unported License.
|
|
http://creativecommons.org/licenses/by/3.0/legalcode
|
|
|
|
========================
|
|
Update Config Management
|
|
========================
|
|
|
|
Puppet 3 has been EOL since December 31, 2016.
|
|
|
|
We are also increasingly finding ourselves in a position where our
|
|
configuration management is a master we serve rather than a force-multiplier
|
|
to help us get things done quickly and more repeatably.
|
|
|
|
Since our current system was designed, we've also grown capabilities in our
|
|
CI infrastructure that we're ironically unsuited to fully take advantage of.
|
|
|
|
Our CI jobs are now written in Ansible, which means we have a large number
|
|
of people who are interfacing with Ansible on a regular basis making switching
|
|
to Puppet a mental context switch.
|
|
|
|
We tend to run a large amount of our software CD and often either from our
|
|
own sources or from a third party in such a manner that the traditional value
|
|
of installing software that comes from the distro channel is not as great as
|
|
it once was.
|
|
|
|
Problem Description
|
|
===================
|
|
|
|
Our current system uses Ansible to orchestrate the running of Puppet. It
|
|
has grown organically since it was first put in place back in 2011 and it
|
|
is truly amazing it still works as well as it does.
|
|
|
|
However, with the advent of Ansible-based job content in Zuul v3, we are
|
|
regularly writing large amounts of Ansible in service of OpenStack, so the
|
|
cognitive shift back to implementing support for services in Puppet feels
|
|
onerous. In our current system, we aren't taking advantage of the power
|
|
that Zuul's Ansible integration affords us.
|
|
|
|
We currently have an awkward hybrid system with some information in
|
|
Ansible inventory and some information in Puppet hiera with host group
|
|
generation based on a template file.
|
|
|
|
A majority of the services we deploy end up being compiled from source on
|
|
production servers with the help of various development toolchains.
|
|
|
|
While we could start systematically packaging our services and run a
|
|
distribution mirror, it incurs a significant overhead we'd prefer to avoid.
|
|
|
|
More and more of our systems are multi-node distributed systems which are
|
|
not well served by Puppet.
|
|
|
|
We do have a very large corpus of existing Puppet, so an all-in-one change to
|
|
anything is unreasonable.
|
|
|
|
Proposed Change
|
|
===============
|
|
|
|
OpenStack Infra should migrate its control plane from Ansible-orchestrated
|
|
Puppet to Ansible orchestrated Container-based deployments. Doing that has
|
|
three fundamental pieces:
|
|
|
|
* Upgrade the existing Puppet to Puppet 4 (and possibly Puppet 5 depending
|
|
on how long other tasks take).
|
|
* Config management and orchestration should migrate from Puppet to Ansible.
|
|
* Software installation should migrate to container images built in Zuul
|
|
jobs.
|
|
|
|
Puppet 4
|
|
--------
|
|
|
|
Upgrading to Puppet 4 gives us additional time to deal with the other larger
|
|
changes while not feeling pressure due to the Puppet 3 EOL.
|
|
|
|
We need to complete and enhance the puppet module functional tests so that they
|
|
are easier to manage and so they are capable of installing and validating
|
|
services with puppet 4.
|
|
|
|
When we are confident that all the modules needed for a given site.pp node work
|
|
with puppet 4, upgrade that node to puppet 4. We'll track that a node should
|
|
be running puppet 4 by adding a new puppet-4 group in groups.txt and adding
|
|
nodes to it one at a time. A new playbook will be written that runs an
|
|
idempotent upgrade for those nodes.
|
|
|
|
Ansible
|
|
-------
|
|
|
|
We should update to using at least Ansible 2.5 and the OpenStack Inventory
|
|
plugin instead of the exisitng inventory script. Inventory plugins are
|
|
stackable, so we should be able to rework the group membership and emergency
|
|
file system to be a collection of Inventory plugins instead of a static
|
|
generation script.
|
|
|
|
We should shift away from the ``run_all.sh`` model and move instead to
|
|
per-service ansible playbooks. The first version of the per-service playbooks
|
|
can simply call puppet as the existing playbook does. As we write these, we
|
|
should make individual cron jobs for each playbook as each of their running
|
|
does not depend on the other.
|
|
|
|
We should then replace the cron jobs with Zuul jobs in
|
|
``openstack-infra/system-config``. Those jobs should use ``add_host`` to
|
|
add ``puppetmaster.openstack.org`` with the secret key for ssh in a Zuul
|
|
secret. Using a secret and ``add_host`` will ensure the jobs can't be used by
|
|
other projects. Since the job will use ``add_host`` for puppetmaster, the job
|
|
itself can be nodeless, which should ensure we don't have issues running
|
|
deployment jobs while under periods of high build traffic.
|
|
|
|
We should `add a central ARA instance`_ that ansible is configured to log to
|
|
when run on puppetmaster. That way sysadmins running ansible by hand on
|
|
puppetmaster and Zuul-driven jobs will both log to a consistent place. The
|
|
ssh key for a zuul user account on the puppetmaster host can be stored in
|
|
Zuul as a secret. As a future improvement, improving ARA to be able to
|
|
export reports from one ARA and import them into a second ARA could allow us
|
|
to log Zuul-driven playbook runs to a per-job ARA - as well as have that same
|
|
report data go into the central ARA.
|
|
|
|
We should migrate the ``openstack_project::server`` base puppet pieces to
|
|
Ansible roles in the ``openstack-infra/system-config`` repo. This currently
|
|
involves creating users, setting timezone to UTC, setting up rsyslog,
|
|
configuring apt to retry and not pull translations, setting up ntp, setting
|
|
up the root ssh account for ansible management, setting up snmp, disabling
|
|
cloud-init and uninstalling some things we don't need. There are options for
|
|
installing the AFS client, managing exim, enabling unbound and managing
|
|
iptables rules that should just be turned in to roles and included in the
|
|
playbooks for a given service. Similarly, we install pip/virtualenv in
|
|
``openstack_project::server``. We should be able to just stop doing that since
|
|
we'll be shifting from installing things directly on the system with pip to
|
|
installing them in containers, although we still want to put it in the
|
|
ansible version of ``openstack_project::server`` so that we can transition
|
|
our puppet services one at a time.
|
|
|
|
We should keep the roles in the ``roles`` directory of
|
|
``openstack-infra/system-config`` for the time being. While we might want to
|
|
eventually split things out into role repositories in the future, there is
|
|
enough complication related to an in-place CD transition from puppet to ansible
|
|
without over-organizing at the outset.
|
|
|
|
Once we have per-service playbooks and base server roles, we can begin to
|
|
rework the services to be native Ansible one service at a time.
|
|
|
|
As we work on per-service Ansible, We should shift our secrets from hiera
|
|
to an Ansible inventory based set of host and group vars. They can continue to
|
|
be yaml files in a private git repo - and in fact the current structure may
|
|
even work directly. But rather than having ansible copy some variables to the
|
|
remote host so that the local puppet apply has access to the hiera data,
|
|
if the code being run is actual ansible roles and playbooks we can just have
|
|
Ansible use the secrets as variables and stop copying secret chunks to remote
|
|
hosts completely. Cleaning up and organizing the existing hiera data first is
|
|
likely a good idea.
|
|
|
|
We may want to investiagate use of Ansible Vault to store the secrets with
|
|
GPG for encrypting/decrypting secrets. The GPG private key for zuul can be
|
|
stored as a Zuul secret, and we can encrypt things for the union of Zuul and
|
|
infra-root. However, this would be more than what we're doing currently with
|
|
hiera, so should be considered a future improvement. If we shift from
|
|
``add_host`` for adding puppetmaster to using the static driver for
|
|
puppetmaster, then we may want to consider shifting to protecting the secrets
|
|
using vault with a GPG key in a Zuul secret. Doing so would be a
|
|
belt-and-suspenders for protection against the node being used in the wrong
|
|
context.
|
|
|
|
On a per-service basis, as we migrate from Puppet to Ansible, we may find
|
|
that updating to installing the software via containers at the same time is
|
|
more straightforward than breaking it into two steps.
|
|
|
|
Containers
|
|
----------
|
|
|
|
We should start installing the software for the services we run using
|
|
thin per-process containers based on images that we build in the CI system.
|
|
|
|
We should build and run those containers using Docker. We should install
|
|
Docker from the upstream Docker package repository.
|
|
|
|
|
|
Adoption of container technology can happen in phases and in parallel to the
|
|
Ansible migration so that we're not biting off too much at one time, nor
|
|
blocking progress on a phased approach. There is no need to go overboard
|
|
needlessly. If a service doesn't make sense in containers, such as potentially
|
|
AFS, we can just run those services as we are running them now except using
|
|
Ansible instead of Puppet. Services like AFS or exim, where we're installing
|
|
from distro packages anyway are less likely to see a win from bundling the
|
|
software into containers first. On the other hand, services where we're
|
|
installing from source in production like Zuul, or building artifacts in CI
|
|
like Gerrit (nearly all of our services) are the most likley to see a win and
|
|
should be focused on first.
|
|
|
|
Building container images in CI allows us to decouple essential dependency
|
|
versions from underlying distro releases. Where possible, we should prefer to
|
|
use ecosystem-specific base images rather than distro-specific base images.
|
|
For instance, we should build container images for each zuul service using the
|
|
``python:3.6-slim`` base image with Python 3.6, a container for etherpad using
|
|
the ``nodejs`` base image with the correct tag version of node and a container
|
|
for Gerrit with the ``openjdk`` base image. For our Python services, a new tool
|
|
is in work, `pbrx`_, which has a command for making single-process containers
|
|
from pbr setup.cfg files and bindep.txt.
|
|
|
|
The container images we make should be single-process containers and should
|
|
use `dumb-init`_ as an Entrypoint so that signals and forking work properly.
|
|
This will allow us to start building and using containers of various pieces
|
|
of software only by changing the software installation and init scripts even
|
|
while config files, data volumes and the like are still managed by puppet.
|
|
Config files and data volumes will be exposed to the running container via
|
|
normal bind mounts. Something like:
|
|
|
|
.. code-block:: console
|
|
|
|
docker run -v /etc/zuul:/etc/zuul -v /var/log/zuul:/var/log/zuul zuul/zuul-scheduler
|
|
|
|
By doing this, we'll still have config files and log files in locations we
|
|
expect.
|
|
|
|
|
|
Our services are all currently designed with the assumption that they exist
|
|
in a direct internet networking environment. Network namespacing is not a
|
|
feature that provides value to us, so we should run docker with
|
|
``--network host`` to disable network namespacing.
|
|
|
|
We also have a need to run local commands, such as ``zuul enqueue``. Those
|
|
commands should all exist in the containers, so something like:
|
|
|
|
.. code-block:: console
|
|
|
|
docker run -it --rm zuul/zuul -- enqueue
|
|
|
|
would do the trick, but is a bit unwieldy. We should add wrapper scripts to
|
|
the surrounding host that allow us to run utility scripts from the services
|
|
as needed. So that a ``/usr/local/bin/zuul`` script would be:
|
|
|
|
.. code-block:: console
|
|
|
|
docker run -it --rm zuul/zuul -- $*
|
|
|
|
Generating those scripts could be a utility that we add to `pbrx`_ - or it
|
|
could be an Ansible role we write.
|
|
|
|
Alternatives
|
|
------------
|
|
|
|
- Stay on puppet 3 forever
|
|
- Stay on puppet forever but upgrade it
|
|
- Migrate to ansible but without containers
|
|
- Building distro packages of our software
|
|
- Use software other than docker for containers
|
|
|
|
There are alternative tools for building and running containers that can be
|
|
explored. To keep initial adoption simple, starting with Docker seems like the
|
|
best bet, but alternate technology can be explored as a follow on. The Docker
|
|
daemon is a piece of operational complexity that is not required to use Linux
|
|
namespaces.
|
|
|
|
For building images we (or `pbrx`_) can use Ansible playbooks or `img`_ or
|
|
`buildah`_ or `s2i`_. For running containers we can look at `rkt`_ or
|
|
`podman`_. Since `rkt`_ and `podman`_ follow a traditional fork/exec model
|
|
rather than having a daemon, we'd want to use systemd to ensure services run
|
|
on boot or are restarted appropriately. As we start working, it may end up
|
|
being an easier transition from systemd-process to systemd-podman-container
|
|
than to transition from systemd-process to docker-container to
|
|
systemd-podman-container.
|
|
|
|
If, in the future, we deploy a Container Orchestration Engine such as
|
|
Kubernetes, we'll should consider running it with `cri-o`_ to avoid the Docker
|
|
daemon on the backend.
|
|
|
|
Implementation
|
|
==============
|
|
|
|
Assignee(s)
|
|
-----------
|
|
|
|
Primary assignee:
|
|
mordred
|
|
Colleen Murphy <colleen@gazlene.net>
|
|
infra-root
|
|
|
|
mordred can help get it going, but in order for it to be successful, we'll
|
|
all need to be involved. Coleen has already done more of Puppet 4.
|
|
|
|
Gerrit Topic
|
|
------------
|
|
|
|
Use Gerrit topic "update-cfg-mgmt" for all patches related to this spec.
|
|
|
|
.. code-block:: bash
|
|
|
|
git-review -t update-cfg-mgmt
|
|
|
|
Work Items
|
|
----------
|
|
|
|
Puppet
|
|
~~~~~~
|
|
|
|
#. Complete and enhance puppet module functional tests.
|
|
|
|
##. We need to ensure all modules have proper functional tests that at least
|
|
perform a basic smoke test.
|
|
|
|
##. The functional tests need to accept a puppet version parameter.
|
|
|
|
##. An experimental functional test job needs to be added to use the puppet
|
|
version parameter. The job should be graduated to non-voting and then to gating.
|
|
|
|
#. Audit all upstream modules in modules.env for version compatibility and take
|
|
steps to upgrade to a cross-compatible version if necessary.
|
|
|
|
#. Turn on the future parser in puppet.conf on all nodes in production. The
|
|
future parser will start interpreting manifests with puppet 4 syntax without
|
|
actually having to run the upgrade yet.
|
|
|
|
#. Enhance the install_puppet.sh script for puppet 4
|
|
|
|
##. The script already installs puppet 4 when PUPPET_VERSION=4 is set. Since
|
|
this script is currently only run during launch-node.py and not periodically, we
|
|
do not need to worry about the script accidentally downgrading puppet at some
|
|
point after the upgrade. However, in the event things go wrong and we want to
|
|
revert a node back to puppet 3, we need to be able to manually run the script
|
|
again to forcefully downgrade, so we most likely need to enhance the script to
|
|
ensure this works properly.
|
|
|
|
#. Write ansible logic to notate nodes that should be running puppet 4 and run
|
|
the upgrade.
|
|
|
|
## The playbook will need to run the install_puppet.sh script with
|
|
PUPPET_VERSION=4.
|
|
|
|
Ansible
|
|
~~~~~~~
|
|
|
|
#. Split run_all.yaml into service-specific playbooks.
|
|
#. Rewrite ``openstack_project::server`` in Ansible (infra.server).
|
|
#. Add a playbook targetting hosts: all that runs infra.server.
|
|
#. Either add "install docker" to infra.server or make an ansible hostgroup
|
|
that contains it.
|
|
#. Either rewrite launch_node.py to bootstrap using infra.server or ensure we
|
|
can use ansible-cloud-launcher instead.
|
|
#. Install a local container registry service as our first docker-based service.
|
|
|
|
On a service by service basis:
|
|
|
|
#. Add a Zuul job to build the software into container(s) and publish the
|
|
containers into our local container registry (and to dockerhub)
|
|
#. Translate the puppet for the service into ansible that runs the software from
|
|
the container.
|
|
#. Add a Zuul job that runs the new ansible with the container for testing.
|
|
#. Change the service's playbook to use the ansible/container deployment.
|
|
#. Retire the service's puppet.
|
|
|
|
Repositories
|
|
------------
|
|
|
|
We may need to create some repositories as a place to put
|
|
jobs/roles/Dockerfiles for services where we aren't tracking a git repo locally
|
|
already. For instance, etherpad doesn't have a great place for us to put
|
|
things.
|
|
|
|
When we're done, we'll have a LOT of ``puppet-.*`` repos that we will no longer
|
|
care about. We should soft-retire them, leaving them in place for people still
|
|
depending on them.
|
|
|
|
Servers
|
|
-------
|
|
|
|
All existing servers will be affected.
|
|
|
|
DNS Entries
|
|
-----------
|
|
|
|
Not explicitly.
|
|
|
|
Documentation
|
|
-------------
|
|
|
|
All of the Infra system config documentation on how to manage puppet things
|
|
will need to be rewritten. OpenStack developers should not notice any changes
|
|
in their daily workflow.
|
|
|
|
Security
|
|
--------
|
|
|
|
This change should help improve security since we'll be getting security
|
|
updates to Puppet.
|
|
|
|
We are not proposing using containers for any increased isolation at this point
|
|
other than as a build step and convenient software installation vehicle.
|
|
However, building container images in CI and then deploying them means we will
|
|
need to track software manifests of the built images so that we can know if
|
|
we need to trigger a container rebuild due to CVEs.
|
|
|
|
We should make sure we have a mechanism to trigger a rebuild / redeploy.
|
|
|
|
We could **also** periodically rebuild / redeploy our service containers just
|
|
in case we miss a CVE somewhere.
|
|
|
|
The playbooks will be run by Zuul on puppetmaster using the secrets system to
|
|
protect the private ssh key. Normal infra-core reviews in system-config should
|
|
be sufficient to protect this.
|
|
|
|
Testing
|
|
-------
|
|
|
|
Puppet 4 is already validated with the puppet-apply noop tests and this spec
|
|
proposes enhancing the module functional tests before proceeding with the
|
|
upgrade.
|
|
|
|
As we shift to Ansible, the functional tests for puppet need to be shifted
|
|
as well. We should use `testinfra`_ for our Ansible testing.
|
|
|
|
We're currently using `serverspec`_ with our Puppet. However, `serverspec`_ is
|
|
Ruby-based which is additional context for admins to deal with and carrying
|
|
that additional context with a shift to Ansible seems less desirable.
|
|
|
|
`testinfra`_ is python-based, so fits with the larger majority of our
|
|
ecosystem, but will require us to write all new tests. It has ansible, docker
|
|
and kubectl backends, so should allow us to plug in to things where we'd like
|
|
to. It is implemented as a **py.test** plugin, which has a different
|
|
test-writing paradigm than we are used to with **testtools**, but the context
|
|
shift there is still likely less than the python to ruby context shift.
|
|
|
|
On a per-service basis, as we transition a service from Puppet to Ansible, we
|
|
should the deploy playbooks such that they can be run in Zuul. We should then
|
|
make jobs that use those playbooks against nodes in the jobs and then run
|
|
`testinfra`_ tests to validate the playbooks did the right thing.
|
|
|
|
Since the rest of our testing is **subunit** based, we may want to pick up
|
|
the work on `pytest-subunit`_.
|
|
|
|
Dependencies
|
|
============
|
|
|
|
Concurrent with `add a central ARA instance`_.
|
|
|
|
Docker will need to be installed and we'll want to decide if we want to use
|
|
the distro-supplied Docker or install it more directly from Docker upstream.
|
|
|
|
We'll need to run a container registry into which we can publish our container
|
|
images so that we are not dependent on hub.docker.com to update our system.
|
|
We should still publish our containers to hub.docker.com as well.
|
|
|
|
References
|
|
==========
|
|
|
|
- `Puppet Inc Upgrade Announcement <https://docs.puppet.com/upgrade>`_
|
|
- `Puppet 4 Release notes <https://docs.puppet.com/puppet/4.0/release_notes.html>`_
|
|
- `Features of the Puppet 4 Language <https://www.devco.net/archives/2015/07/31/shiny-new-things-in-puppet-4.php>`_
|
|
|
|
.. _`add a central ARA instance`: https://review.openstack.org/527500/
|
|
.. _`Puppet 4 Preliminary Testing spec`: https://docs.opendev.org/opendev/infra-specs/latest/specs/puppet_4_prelim_testing.html
|
|
.. _dumb-init: https://github.com/Yelp/dumb-init
|
|
.. _openshift-ansible: https://github.com/openshift/openshift-ansible
|
|
.. _oc cluster up: https://github.com/openshift/origin/blob/master/docs/cluster_up_down.md
|
|
.. _cri-o: http://cri-o.io/
|
|
.. _pbrx: http://git.openstack.org/cgit/openstack/pbrx
|
|
.. _img: https://github.com/genuinetools/img
|
|
.. _buildah: https://github.com/projectatomic/buildah
|
|
.. _s2i: https://github.com/openshift/source-to-image
|
|
.. _rkt: https://coreos.com/rkt/
|
|
.. _podman: https://github.com/projectatomic/libpod
|
|
.. _testinfra: https://testinfra.readthedocs.io/en/latest/
|
|
.. _serverspec: https://serverspec.org/
|
|
.. _pytest-subunit: https://github.com/lukaszo/pytest-subunit
|