docs: reorganise around a open infrastructure overview
This introduces and "Open Infrastructure" page which is designed for a moderately experienced developer with some understanding of Zuul, Ansible and basic Linux admin skills to have an entrypoint to navigating the system-config and related repositories. It is designed to re-enforce the idea of open infrastructure, and explain how development, testing and production come together at a level high enough to be understood, but with links or descriptions of specific places in the code to get started. It moves a little of what was in the sysadmin page into this, and leaves that page as more low-level descriptions of various tasks. Change-Id: I60a9299df455b98ad549ac0075a59d381722bc06
This commit is contained in:
parent
3f6cd427d7
commit
4c86706e5e
@ -183,9 +183,8 @@ After the cloud is configured, it can be added as a resource for
|
|||||||
nodepool to use for testing nodes.
|
nodepool to use for testing nodes.
|
||||||
|
|
||||||
Firstly, an ``infra-root`` member will need to make the region-local
|
Firstly, an ``infra-root`` member will need to make the region-local
|
||||||
mirror server, configure any required storage for it and setup DNS
|
mirror server, configure any required storage for it and setup DNS.
|
||||||
(see :ref:`adding_new_server`). With this active, the cloud is ready
|
With this active, the cloud is ready to start running testing nodes.
|
||||||
to start running testing nodes.
|
|
||||||
|
|
||||||
At this point, the cloud needs to be added to nodepool configuration
|
At this point, the cloud needs to be added to nodepool configuration
|
||||||
in `project-config
|
in `project-config
|
||||||
|
@ -34,6 +34,7 @@ Contents:
|
|||||||
:maxdepth: 2
|
:maxdepth: 2
|
||||||
|
|
||||||
project
|
project
|
||||||
|
open-infrastructure
|
||||||
test-infra-requirements
|
test-infra-requirements
|
||||||
sysadmin
|
sysadmin
|
||||||
systems
|
systems
|
||||||
|
301
doc/source/open-infrastructure.rst
Normal file
301
doc/source/open-infrastructure.rst
Normal file
@ -0,0 +1,301 @@
|
|||||||
|
:title: Open Infrastructure Technical Overview
|
||||||
|
|
||||||
|
.. _opendev-infra-overview:
|
||||||
|
|
||||||
|
Open Infrastructure Technical Overview
|
||||||
|
######################################
|
||||||
|
|
||||||
|
The OpenDev system administration team strives to run the services
|
||||||
|
behind the OpenDev Collaboratory as an open source project; we term
|
||||||
|
this *open infrastructure*.
|
||||||
|
|
||||||
|
Our infrastructure is code and contributions to it are handled just
|
||||||
|
like the rest of OpenDev. This means that anyone can contribute to
|
||||||
|
the installation and long-running maintenance of systems without shell
|
||||||
|
access, and anyone who is interested can provide feedback and
|
||||||
|
collaborate on code reviews. There are no permissions or special
|
||||||
|
privileges required to contribute to the OpenDev infrastructure
|
||||||
|
project.
|
||||||
|
|
||||||
|
Below is a short guide to the major pieces of the project. Some
|
||||||
|
knowledge of Zuul job configuration, Ansible, interaction with the
|
||||||
|
Gerrit code-review system and general Linux administration are
|
||||||
|
assumed; however expertise is not required.
|
||||||
|
|
||||||
|
Operating environment
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
The OpenDev production systems run in resources (compute, network,
|
||||||
|
storage) provided by donations from companies who support the project.
|
||||||
|
|
||||||
|
Our standard production system is based on the latest Ubuntu LTS
|
||||||
|
release.
|
||||||
|
|
||||||
|
Production systems are deployed by Ansible. Most production
|
||||||
|
applications run from containers; some are custom built and others we
|
||||||
|
use unmodified from upstream sources.
|
||||||
|
|
||||||
|
Zuul handles the testing and deployment of all changes. Current
|
||||||
|
trends would refer to this as a *gitops* model -- all production
|
||||||
|
changes are ultimately driven by a change proposed to the code-review
|
||||||
|
system. This means we do not have bespoke production systems and any
|
||||||
|
modifications we make are reviewed by peers and logged with change
|
||||||
|
history.
|
||||||
|
|
||||||
|
We have a *bastion host*, or *bridge*, which is a static host with
|
||||||
|
permissions to deploy to the production systems. Zuul will run
|
||||||
|
Ansible on the production systems via this host to deploy new changes
|
||||||
|
into production.
|
||||||
|
|
||||||
|
Getting started - CI
|
||||||
|
--------------------
|
||||||
|
|
||||||
|
The configuration of every system operated by the OpenDev sysadmins is
|
||||||
|
managed by Ansible and driven by continuous integration and deployment
|
||||||
|
by Zuul. This is almost exclusively driven by code kept in the
|
||||||
|
``system-config`` repository, which can be browsed at:
|
||||||
|
|
||||||
|
https://opendev.org/opendev/system-config
|
||||||
|
|
||||||
|
All system configuration should be encoded in that repository so that
|
||||||
|
anyone may propose a change in the running configuration to Gerrit.
|
||||||
|
|
||||||
|
Any change to the OpenDev infrastructure system is first proposed as a
|
||||||
|
review to this repository at ``review.opendev.org``. The current open
|
||||||
|
reviews can be seen at
|
||||||
|
|
||||||
|
https://review.opendev.org/q/project:opendev/system-config
|
||||||
|
|
||||||
|
Zuul will first run CI on all incoming changes. Each service
|
||||||
|
generally has its own CI job that runs when relevant files
|
||||||
|
(configuration, Ansible roles, playbooks, etc.) are updated. These
|
||||||
|
are generally called ``system-config-run-<service>``; Zuul will post a
|
||||||
|
comment when the change has been tested, or you can see in-flight
|
||||||
|
testing at the status page
|
||||||
|
|
||||||
|
https://zuul.opendev.org/t/openstack/status
|
||||||
|
|
||||||
|
These jobs are crafted in a way that they replicate production as much
|
||||||
|
as possible. Reading the job definitions in in
|
||||||
|
:git_file:`zuul.d/system-config-run.yaml` will give you a feel for the
|
||||||
|
hosts that are set up with each job. When you view the job results in
|
||||||
|
the Zuul UI, you will see many logs collected from a number of hosts
|
||||||
|
that simulate the production environment. This has all the
|
||||||
|
information you generally need to debug problems, but the best place
|
||||||
|
to start is with the *artifacts* tab, which has some curated links to
|
||||||
|
useful overviews.
|
||||||
|
|
||||||
|
One of the job artifacts is the `ARA report
|
||||||
|
<https://ara.readthedocs.io/en/latest/>`__. This is a graphical view
|
||||||
|
of the *nested* Ansible run on the (ephemeral) bastion host against
|
||||||
|
the (ephemeral) production-test nodes. This is generally the first
|
||||||
|
stop for finding deployment issues.
|
||||||
|
|
||||||
|
Another artifact is the ``testinfra results``. `Testinfra
|
||||||
|
<https://testfinra.readthedoocs.io>`__ allows us to define
|
||||||
|
unit-test-like behaviour to test functionality such as service and API
|
||||||
|
status, correct deployment of users and files and other interesting
|
||||||
|
details. Failures here would indicate the the deployment steps
|
||||||
|
worked, but some part of the operation of that system is not as we
|
||||||
|
expect. The ``testinfra`` code driving this is kept in
|
||||||
|
:git_file:`testinfra` and test files are named for the service they
|
||||||
|
test.
|
||||||
|
|
||||||
|
Finally there is a ``screenshots`` artifact, which is a link to a
|
||||||
|
directory that some tests populate with image files. Tests that are
|
||||||
|
bringing up interactive services will use a headless browser to take
|
||||||
|
shots of important pages to verify correct operation.
|
||||||
|
|
||||||
|
The logs tab has links the the raw logs; this collects much more
|
||||||
|
detail such as ``syslog``, Apache logs, database dumps, etc. Once you
|
||||||
|
have identified the general problem from the above steps, these logs
|
||||||
|
provide the in-depth details for further analysis.
|
||||||
|
|
||||||
|
Playbooks and roles
|
||||||
|
-------------------
|
||||||
|
|
||||||
|
The starting point for all services is generally the playbooks and
|
||||||
|
roles kept in :git_file:`playbooks/`. Most playbooks are named
|
||||||
|
``service-<name>.yaml`` and will indicate from their naming which
|
||||||
|
production areas they drive.
|
||||||
|
|
||||||
|
During testing, these same playbooks are run against the test nodes.
|
||||||
|
You can note that the testing hosts are given names that match the
|
||||||
|
group configuration in the jobs defined in
|
||||||
|
:git_file:`zuul.d/system-config-run.yaml`.
|
||||||
|
|
||||||
|
These playbooks are usually small and they call out to roles where
|
||||||
|
most of the work is done. Roles are kept in
|
||||||
|
:git_file:`playbooks/roles/`. These roles are written to be as
|
||||||
|
generic as possible, but they are not expected to be used outside the
|
||||||
|
OpenDev production deployment system.
|
||||||
|
|
||||||
|
These playbooks and roles are the same for CI and deployment.
|
||||||
|
|
||||||
|
Hosts and variables
|
||||||
|
-------------------
|
||||||
|
|
||||||
|
The playbooks above run on groups of hosts which are defined in
|
||||||
|
:git_file:`inventory/service/groups.yaml`.
|
||||||
|
|
||||||
|
The production hosts are kept in an inventory at
|
||||||
|
:git_file:`inventory/base/hosts.yaml`. In CI, the inventory is
|
||||||
|
generated by Zuul (as it is allocating ephemeral nodes from the
|
||||||
|
testing pool).
|
||||||
|
|
||||||
|
Public production and testing variables are kept under
|
||||||
|
:git_file:`inventory/`. The one difference between CI and production
|
||||||
|
is *secrets* such as API keys, tokens and passwords; in production the
|
||||||
|
*nested* Ansible will populate these variables for the deployment
|
||||||
|
directly from values stored on the bastion host. In CI, dummy values
|
||||||
|
should be populated into the templates under
|
||||||
|
:git_file:`playbooks/zuul/templates/`.
|
||||||
|
|
||||||
|
Production secrets are currently managed manually by OpenDev
|
||||||
|
administrators on the bastion host.
|
||||||
|
|
||||||
|
Deployment
|
||||||
|
----------
|
||||||
|
|
||||||
|
After review and approval of a change, Zuul will perform final gate
|
||||||
|
testing and merge the change on your behalf.
|
||||||
|
|
||||||
|
Just as uploading a new change triggers Zuul to run CI tests in the
|
||||||
|
*check* pipeline, and approving a change triggers Zuul to run gate
|
||||||
|
tests and merge in the *gate* pipeline, the merge of a change triggers
|
||||||
|
Zuul to run the deployment jobs in the *deploy* pipeline.
|
||||||
|
|
||||||
|
These jobs are named ``infra-prod-<service>`` and run the same
|
||||||
|
playbooks and roles as in the CI system, except against the production
|
||||||
|
services. Zuul will deploy the merged changes to the bastion host,
|
||||||
|
and then trigger the bastion host to run a *nested* Ansible deployment
|
||||||
|
against the production host..
|
||||||
|
|
||||||
|
Since the production run logs may leak sensitive information, they are
|
||||||
|
not published openly. You can add a GPG public key to
|
||||||
|
:git_file:`playbooks/zuul/roles/encrypt-logs/defaults/main.yaml` and
|
||||||
|
then ensure the ``infra-prod-<service>`` production has your name in
|
||||||
|
its ``encrypt_logs_job_recipients`` variable. Once approved and
|
||||||
|
committed, you will then be able to view the encrypted production log
|
||||||
|
output provided via the Zuul build page for the production run.
|
||||||
|
|
||||||
|
Containers
|
||||||
|
----------
|
||||||
|
|
||||||
|
Most services are containerised. When looking at the
|
||||||
|
``system-config-run-*`` and ``infra-prod-*`` jobs you may see dependencies
|
||||||
|
on container build/upload/promote jobs; this indicates we have jobs
|
||||||
|
that build a bespoke container for this environment.
|
||||||
|
|
||||||
|
The base ``Dockerfile`` for these containers is found under
|
||||||
|
:git_file:``docker/``. Most are straight forward, but some of the more
|
||||||
|
complicated services have multiple steps and layers. Any changes to
|
||||||
|
the ``Dockerfile`` will be tested as usual, and when approved the
|
||||||
|
containers will be rebuilt, published and pulled onto the production
|
||||||
|
systems automatically.
|
||||||
|
|
||||||
|
Certificates
|
||||||
|
------------
|
||||||
|
|
||||||
|
We provision SSL certificates from LetsEncrypt; see
|
||||||
|
:ref:`letsencrypt`.
|
||||||
|
|
||||||
|
DNS
|
||||||
|
---
|
||||||
|
|
||||||
|
DNS for ``opendev.org`` (and some other domains) is also handled through
|
||||||
|
the review system; see the
|
||||||
|
`<https://opendev.org/opendev/zone-opendev.org/>`__ project.
|
||||||
|
|
||||||
|
Backups
|
||||||
|
-------
|
||||||
|
|
||||||
|
Any host in the ``backup`` group will have backups to two
|
||||||
|
geographically distinct locations setup by the deployment
|
||||||
|
infrastructure. See the ``borg-backup`` role for details on including
|
||||||
|
or excluding various data.
|
||||||
|
|
||||||
|
Remote access
|
||||||
|
-------------
|
||||||
|
|
||||||
|
Hosts are only configured by Ansible, but they can be setup for
|
||||||
|
interactive access if required.
|
||||||
|
|
||||||
|
Add your public key to :git_file:`inventory/base/group_vars/all.yaml`
|
||||||
|
and include a stanza like this in your server ``host_vars``::
|
||||||
|
|
||||||
|
extra_users:
|
||||||
|
- your_user_name
|
||||||
|
|
||||||
|
See :ref:`ssh-access` for details on keys.
|
||||||
|
|
||||||
|
Documentation
|
||||||
|
-------------
|
||||||
|
|
||||||
|
Each service should have an RST file with documentation about the
|
||||||
|
server and services in :git_file:`doc/source/`.
|
||||||
|
|
||||||
|
Submitting Changes
|
||||||
|
------------------
|
||||||
|
|
||||||
|
If you are not familiar with submitting changes to Gerrit, you can
|
||||||
|
start with any of the various developer guides such as ::
|
||||||
|
|
||||||
|
https://docs.opendev.org/opendev/infra-manual/latest/gettingstarted.html
|
||||||
|
https://docs.openstack.org/doc-contrib-guide/quickstart/first-timers.html
|
||||||
|
https://docs.opendev.org/opendev/infra-manual/latest/developers.html
|
||||||
|
|
||||||
|
The change description is very important and the major source of
|
||||||
|
historical information. It is expected a developer can read the
|
||||||
|
description of a change and have the context to generally understand
|
||||||
|
why it was introduced. Comments in the code-review system are useful
|
||||||
|
to understand the deeper history of each change, but each change
|
||||||
|
should stand-alone once committed. Only the most trivial of changes
|
||||||
|
that are completely self-evident (e.g. typo fixes) would be expected
|
||||||
|
to have less than a few sentences of context in their change log.
|
||||||
|
|
||||||
|
Lifecycle
|
||||||
|
---------
|
||||||
|
|
||||||
|
We welcome all changes and contributions to the project.
|
||||||
|
|
||||||
|
Before starting work to deploy a new service that will require
|
||||||
|
resources, you should do some preparation work. Putting an item on
|
||||||
|
the `weekly team meeting agenda
|
||||||
|
<https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting>`__ agenda
|
||||||
|
is always welcome. Logs of previous meetings can be seen at
|
||||||
|
`<https://meetings.opendev.org/#OpenDev_Meeting>`__. More complicated
|
||||||
|
changes may justify going through the spec process; see
|
||||||
|
`<https://opendev.org/opendev/infra-specs>`_. If the existing admins
|
||||||
|
are aware of the details before reviews start appearing it makes the
|
||||||
|
process much smoother.
|
||||||
|
|
||||||
|
All preliminary work can be done in an iterative fashion using the CI
|
||||||
|
jobs at your own pace. The ``#opendev`` IRC channel on ``OFTC`` is a
|
||||||
|
good place to find help during this process. Alternatively, questions
|
||||||
|
are welcome on the `service-discuss list
|
||||||
|
<http://lists.opendev.org/cgi-bin/mailman/listinfo/service-discuss>`__
|
||||||
|
This change (or changes) will be reviewed and may take a few rounds
|
||||||
|
before final approval (in Gerrit terms, a ``+2`` vote). Most changes
|
||||||
|
will receive a few ``-1`` votes from reviewers during development.
|
||||||
|
This is really just a flag to note that some further discussion is
|
||||||
|
required; it is not a rejection.
|
||||||
|
|
||||||
|
You can set ``Workflow`` to ``-1`` in Gerrit on changes you are
|
||||||
|
working on, or some developers like to put ``[WIP]`` at the front of
|
||||||
|
their change description to indicate to reviewers they probably
|
||||||
|
shouldn't spend much time on this yet, as you are still working on it.
|
||||||
|
Small, stand-alone sequential changes are encouraged, and Zuul makes
|
||||||
|
testing such "stacks" of changes trivial.
|
||||||
|
|
||||||
|
We currently have admins manually deploy production virtual-machines,
|
||||||
|
storage attached to those machines and secrets to the bastion host.
|
||||||
|
This will need to happen before changes are put into production.
|
||||||
|
Discussion with the admins will help decide on which cloud provider,
|
||||||
|
the VM storage/size and other such matters.
|
||||||
|
|
||||||
|
Once resources are allocated and the new host is available in the
|
||||||
|
inventory, the production jobs can deploy. After this the service
|
||||||
|
moves into a maintenance phase; changes can be proposed and, after
|
||||||
|
review, deployed.
|
||||||
|
|
@ -1,89 +1,15 @@
|
|||||||
:title: System Administration
|
:title: System Administration
|
||||||
|
|
||||||
|
This page collects technical information of relevance to those
|
||||||
|
interested in admin of OpenDev services. For a higher-level overview,
|
||||||
|
see :ref:`opendev-infra-overview`.
|
||||||
|
|
||||||
.. _sysadmin:
|
.. _sysadmin:
|
||||||
|
|
||||||
System Administration
|
System Administration
|
||||||
#####################
|
#####################
|
||||||
|
|
||||||
Our infrastructure is code and contributions to it are handled just
|
.. _ssh-access:
|
||||||
like the rest of OpenDev. This means that anyone can contribute to
|
|
||||||
the installation and long-running maintenance of systems without shell
|
|
||||||
access, and anyone who is interested can provide feedback and
|
|
||||||
collaborate on code reviews.
|
|
||||||
|
|
||||||
The configuration of every system operated by the infrastructure team
|
|
||||||
is managed by Ansible and driven by continuous integration and
|
|
||||||
deployment by Zuul.
|
|
||||||
|
|
||||||
https://opendev.org/opendev/system-config
|
|
||||||
|
|
||||||
All system configuration should be encoded in that repository so that
|
|
||||||
anyone may propose a change in the running configuration to Gerrit.
|
|
||||||
|
|
||||||
Guide to CI and CD
|
|
||||||
==================
|
|
||||||
|
|
||||||
All development work is based around Zuul jobs and a continuous
|
|
||||||
integration and development workflow.
|
|
||||||
|
|
||||||
The starting point for all services is generally the playbooks and
|
|
||||||
roles kept in :git_file:`playbooks`.
|
|
||||||
Most playbooks are named ``service-<name>.yaml`` and will indicate
|
|
||||||
which production areas they drive.
|
|
||||||
|
|
||||||
These playbooks run on groups of hosts which are defined in
|
|
||||||
:git_file:`inventory/service/groups.yaml`. The production hosts are kept
|
|
||||||
in an inventory at :git_file:`inventory/base/hosts.yaml`. During
|
|
||||||
testing, these same playbooks are run against the test nodes. You can
|
|
||||||
note that the testing hosts are given names that match the group
|
|
||||||
configuration in the jobs defined in
|
|
||||||
:git_file:`zuul.d/system-config-run.yaml`.
|
|
||||||
|
|
||||||
Deployment is run through a bastion host ``bridge.openstack.org``.
|
|
||||||
After changes are approved, Zuul will run Ansible on this host; which
|
|
||||||
will then connect to the production hosts and run the orchestration
|
|
||||||
using the latest committed code. The bridge is a special host because
|
|
||||||
it holds production secrets, such as passwords or API keys, and
|
|
||||||
unredacted logs. As many logs as possible are provided in the public
|
|
||||||
Zuul job results, but they need to be audited to ensure they do not
|
|
||||||
leak secrets and thus in some cases may not be published.
|
|
||||||
|
|
||||||
For CI testing, each job creates a "fake" bridge, along with the
|
|
||||||
servers required for orchestration. Thus CI testing is performed by a
|
|
||||||
"nested" Ansible -- Zuul initially connects to the testing bridge node
|
|
||||||
and deploys it, and then this node runs its own Ansible that tests the
|
|
||||||
orchestration to the other testing nodes, simulating the production
|
|
||||||
environment. This is driven by playbooks kept in
|
|
||||||
:git_file:`playbooks/zuul`. Here you will also find testing
|
|
||||||
definitions of host variables that are kept secret for production
|
|
||||||
hosts.
|
|
||||||
|
|
||||||
After the test environment is orchestrated, the
|
|
||||||
`testinfra <https://testinfra.readthedocs.io/en/latest/>`__ tests from
|
|
||||||
:git_file:`testinfra` are run. This validates the complete
|
|
||||||
orchestration testing environment; things such as ensuring user
|
|
||||||
creation, container readiness and service wellness checks are all
|
|
||||||
performed.
|
|
||||||
|
|
||||||
.. _adding_new_server:
|
|
||||||
|
|
||||||
Adding a New Server
|
|
||||||
===================
|
|
||||||
|
|
||||||
Creating a new server for your service requires discussion with the
|
|
||||||
OpenDev administrators to ensure donor resources are being used
|
|
||||||
effectively.
|
|
||||||
|
|
||||||
* Hosts should only be configured by Ansible. Nonetheless, in some
|
|
||||||
cases SSH access can be granted. Add your public key to
|
|
||||||
:git_file:`inventory/base/group_vars/all.yaml` and include a stanza
|
|
||||||
like this in your server ``host_vars``::
|
|
||||||
|
|
||||||
extra_users:
|
|
||||||
- your_user_name
|
|
||||||
|
|
||||||
* Add an RST file with documentation about the server and services in
|
|
||||||
:git_file:`doc/source` and add it to the index in that directory.
|
|
||||||
|
|
||||||
SSH Access
|
SSH Access
|
||||||
==========
|
==========
|
||||||
|
Loading…
Reference in New Issue
Block a user