From 4c86706e5ea47074a4dd166c7eefc6b77699fbbf Mon Sep 17 00:00:00 2001 From: Ian Wienand Date: Thu, 24 Feb 2022 19:36:48 +1100 Subject: [PATCH] docs: reorganise around a open infrastructure overview This introduces and "Open Infrastructure" page which is designed for a moderately experienced developer with some understanding of Zuul, Ansible and basic Linux admin skills to have an entrypoint to navigating the system-config and related repositories. It is designed to re-enforce the idea of open infrastructure, and explain how development, testing and production come together at a level high enough to be understood, but with links or descriptions of specific places in the code to get started. It moves a little of what was in the sysadmin page into this, and leaves that page as more low-level descriptions of various tasks. Change-Id: I60a9299df455b98ad549ac0075a59d381722bc06 --- doc/source/contribute-cloud.rst | 5 +- doc/source/index.rst | 1 + doc/source/open-infrastructure.rst | 301 +++++++++++++++++++++++++++++ doc/source/sysadmin.rst | 84 +------- 4 files changed, 309 insertions(+), 82 deletions(-) create mode 100644 doc/source/open-infrastructure.rst diff --git a/doc/source/contribute-cloud.rst b/doc/source/contribute-cloud.rst index 9e2d04acc1..d7eb6f29c6 100644 --- a/doc/source/contribute-cloud.rst +++ b/doc/source/contribute-cloud.rst @@ -183,9 +183,8 @@ After the cloud is configured, it can be added as a resource for nodepool to use for testing nodes. Firstly, an ``infra-root`` member will need to make the region-local -mirror server, configure any required storage for it and setup DNS -(see :ref:`adding_new_server`). With this active, the cloud is ready -to start running testing nodes. +mirror server, configure any required storage for it and setup DNS. +With this active, the cloud is ready to start running testing nodes. At this point, the cloud needs to be added to nodepool configuration in `project-config diff --git a/doc/source/index.rst b/doc/source/index.rst index b248d84c4e..e2f3bb56f2 100644 --- a/doc/source/index.rst +++ b/doc/source/index.rst @@ -34,6 +34,7 @@ Contents: :maxdepth: 2 project + open-infrastructure test-infra-requirements sysadmin systems diff --git a/doc/source/open-infrastructure.rst b/doc/source/open-infrastructure.rst new file mode 100644 index 0000000000..e86750b5f1 --- /dev/null +++ b/doc/source/open-infrastructure.rst @@ -0,0 +1,301 @@ +:title: Open Infrastructure Technical Overview + +.. _opendev-infra-overview: + +Open Infrastructure Technical Overview +###################################### + +The OpenDev system administration team strives to run the services +behind the OpenDev Collaboratory as an open source project; we term +this *open infrastructure*. + +Our infrastructure is code and contributions to it are handled just +like the rest of OpenDev. This means that anyone can contribute to +the installation and long-running maintenance of systems without shell +access, and anyone who is interested can provide feedback and +collaborate on code reviews. There are no permissions or special +privileges required to contribute to the OpenDev infrastructure +project. + +Below is a short guide to the major pieces of the project. Some +knowledge of Zuul job configuration, Ansible, interaction with the +Gerrit code-review system and general Linux administration are +assumed; however expertise is not required. + +Operating environment +--------------------- + +The OpenDev production systems run in resources (compute, network, +storage) provided by donations from companies who support the project. + +Our standard production system is based on the latest Ubuntu LTS +release. + +Production systems are deployed by Ansible. Most production +applications run from containers; some are custom built and others we +use unmodified from upstream sources. + +Zuul handles the testing and deployment of all changes. Current +trends would refer to this as a *gitops* model -- all production +changes are ultimately driven by a change proposed to the code-review +system. This means we do not have bespoke production systems and any +modifications we make are reviewed by peers and logged with change +history. + +We have a *bastion host*, or *bridge*, which is a static host with +permissions to deploy to the production systems. Zuul will run +Ansible on the production systems via this host to deploy new changes +into production. + +Getting started - CI +-------------------- + +The configuration of every system operated by the OpenDev sysadmins is +managed by Ansible and driven by continuous integration and deployment +by Zuul. This is almost exclusively driven by code kept in the +``system-config`` repository, which can be browsed at: + + https://opendev.org/opendev/system-config + +All system configuration should be encoded in that repository so that +anyone may propose a change in the running configuration to Gerrit. + +Any change to the OpenDev infrastructure system is first proposed as a +review to this repository at ``review.opendev.org``. The current open +reviews can be seen at + + https://review.opendev.org/q/project:opendev/system-config + +Zuul will first run CI on all incoming changes. Each service +generally has its own CI job that runs when relevant files +(configuration, Ansible roles, playbooks, etc.) are updated. These +are generally called ``system-config-run-``; Zuul will post a +comment when the change has been tested, or you can see in-flight +testing at the status page + + https://zuul.opendev.org/t/openstack/status + +These jobs are crafted in a way that they replicate production as much +as possible. Reading the job definitions in in +:git_file:`zuul.d/system-config-run.yaml` will give you a feel for the +hosts that are set up with each job. When you view the job results in +the Zuul UI, you will see many logs collected from a number of hosts +that simulate the production environment. This has all the +information you generally need to debug problems, but the best place +to start is with the *artifacts* tab, which has some curated links to +useful overviews. + +One of the job artifacts is the `ARA report +`__. This is a graphical view +of the *nested* Ansible run on the (ephemeral) bastion host against +the (ephemeral) production-test nodes. This is generally the first +stop for finding deployment issues. + +Another artifact is the ``testinfra results``. `Testinfra +`__ allows us to define +unit-test-like behaviour to test functionality such as service and API +status, correct deployment of users and files and other interesting +details. Failures here would indicate the the deployment steps +worked, but some part of the operation of that system is not as we +expect. The ``testinfra`` code driving this is kept in +:git_file:`testinfra` and test files are named for the service they +test. + +Finally there is a ``screenshots`` artifact, which is a link to a +directory that some tests populate with image files. Tests that are +bringing up interactive services will use a headless browser to take +shots of important pages to verify correct operation. + +The logs tab has links the the raw logs; this collects much more +detail such as ``syslog``, Apache logs, database dumps, etc. Once you +have identified the general problem from the above steps, these logs +provide the in-depth details for further analysis. + +Playbooks and roles +------------------- + +The starting point for all services is generally the playbooks and +roles kept in :git_file:`playbooks/`. Most playbooks are named +``service-.yaml`` and will indicate from their naming which +production areas they drive. + +During testing, these same playbooks are run against the test nodes. +You can note that the testing hosts are given names that match the +group configuration in the jobs defined in +:git_file:`zuul.d/system-config-run.yaml`. + +These playbooks are usually small and they call out to roles where +most of the work is done. Roles are kept in +:git_file:`playbooks/roles/`. These roles are written to be as +generic as possible, but they are not expected to be used outside the +OpenDev production deployment system. + +These playbooks and roles are the same for CI and deployment. + +Hosts and variables +------------------- + +The playbooks above run on groups of hosts which are defined in +:git_file:`inventory/service/groups.yaml`. + +The production hosts are kept in an inventory at +:git_file:`inventory/base/hosts.yaml`. In CI, the inventory is +generated by Zuul (as it is allocating ephemeral nodes from the +testing pool). + +Public production and testing variables are kept under +:git_file:`inventory/`. The one difference between CI and production +is *secrets* such as API keys, tokens and passwords; in production the +*nested* Ansible will populate these variables for the deployment +directly from values stored on the bastion host. In CI, dummy values +should be populated into the templates under +:git_file:`playbooks/zuul/templates/`. + +Production secrets are currently managed manually by OpenDev +administrators on the bastion host. + +Deployment +---------- + +After review and approval of a change, Zuul will perform final gate +testing and merge the change on your behalf. + +Just as uploading a new change triggers Zuul to run CI tests in the +*check* pipeline, and approving a change triggers Zuul to run gate +tests and merge in the *gate* pipeline, the merge of a change triggers +Zuul to run the deployment jobs in the *deploy* pipeline. + +These jobs are named ``infra-prod-`` and run the same +playbooks and roles as in the CI system, except against the production +services. Zuul will deploy the merged changes to the bastion host, +and then trigger the bastion host to run a *nested* Ansible deployment +against the production host.. + +Since the production run logs may leak sensitive information, they are +not published openly. You can add a GPG public key to +:git_file:`playbooks/zuul/roles/encrypt-logs/defaults/main.yaml` and +then ensure the ``infra-prod-`` production has your name in +its ``encrypt_logs_job_recipients`` variable. Once approved and +committed, you will then be able to view the encrypted production log +output provided via the Zuul build page for the production run. + +Containers +---------- + +Most services are containerised. When looking at the +``system-config-run-*`` and ``infra-prod-*`` jobs you may see dependencies +on container build/upload/promote jobs; this indicates we have jobs +that build a bespoke container for this environment. + +The base ``Dockerfile`` for these containers is found under +:git_file:``docker/``. Most are straight forward, but some of the more +complicated services have multiple steps and layers. Any changes to +the ``Dockerfile`` will be tested as usual, and when approved the +containers will be rebuilt, published and pulled onto the production +systems automatically. + +Certificates +------------ + +We provision SSL certificates from LetsEncrypt; see +:ref:`letsencrypt`. + +DNS +--- + +DNS for ``opendev.org`` (and some other domains) is also handled through +the review system; see the +``__ project. + +Backups +------- + +Any host in the ``backup`` group will have backups to two +geographically distinct locations setup by the deployment +infrastructure. See the ``borg-backup`` role for details on including +or excluding various data. + +Remote access +------------- + +Hosts are only configured by Ansible, but they can be setup for +interactive access if required. + +Add your public key to :git_file:`inventory/base/group_vars/all.yaml` + and include a stanza like this in your server ``host_vars``:: + + extra_users: + - your_user_name + +See :ref:`ssh-access` for details on keys. + +Documentation +------------- + +Each service should have an RST file with documentation about the +server and services in :git_file:`doc/source/`. + +Submitting Changes +------------------ + +If you are not familiar with submitting changes to Gerrit, you can +start with any of the various developer guides such as :: + + https://docs.opendev.org/opendev/infra-manual/latest/gettingstarted.html + https://docs.openstack.org/doc-contrib-guide/quickstart/first-timers.html + https://docs.opendev.org/opendev/infra-manual/latest/developers.html + +The change description is very important and the major source of +historical information. It is expected a developer can read the +description of a change and have the context to generally understand +why it was introduced. Comments in the code-review system are useful +to understand the deeper history of each change, but each change +should stand-alone once committed. Only the most trivial of changes +that are completely self-evident (e.g. typo fixes) would be expected +to have less than a few sentences of context in their change log. + +Lifecycle +--------- + +We welcome all changes and contributions to the project. + +Before starting work to deploy a new service that will require +resources, you should do some preparation work. Putting an item on +the `weekly team meeting agenda +`__ agenda +is always welcome. Logs of previous meetings can be seen at +``__. More complicated +changes may justify going through the spec process; see +``_. If the existing admins +are aware of the details before reviews start appearing it makes the +process much smoother. + +All preliminary work can be done in an iterative fashion using the CI +jobs at your own pace. The ``#opendev`` IRC channel on ``OFTC`` is a +good place to find help during this process. Alternatively, questions +are welcome on the `service-discuss list +`__ +This change (or changes) will be reviewed and may take a few rounds +before final approval (in Gerrit terms, a ``+2`` vote). Most changes +will receive a few ``-1`` votes from reviewers during development. +This is really just a flag to note that some further discussion is +required; it is not a rejection. + +You can set ``Workflow`` to ``-1`` in Gerrit on changes you are +working on, or some developers like to put ``[WIP]`` at the front of +their change description to indicate to reviewers they probably +shouldn't spend much time on this yet, as you are still working on it. +Small, stand-alone sequential changes are encouraged, and Zuul makes +testing such "stacks" of changes trivial. + +We currently have admins manually deploy production virtual-machines, +storage attached to those machines and secrets to the bastion host. +This will need to happen before changes are put into production. +Discussion with the admins will help decide on which cloud provider, +the VM storage/size and other such matters. + +Once resources are allocated and the new host is available in the +inventory, the production jobs can deploy. After this the service +moves into a maintenance phase; changes can be proposed and, after +review, deployed. + diff --git a/doc/source/sysadmin.rst b/doc/source/sysadmin.rst index 2f171572d1..8df4b5ee53 100644 --- a/doc/source/sysadmin.rst +++ b/doc/source/sysadmin.rst @@ -1,89 +1,15 @@ :title: System Administration +This page collects technical information of relevance to those +interested in admin of OpenDev services. For a higher-level overview, +see :ref:`opendev-infra-overview`. + .. _sysadmin: System Administration ##################### -Our infrastructure is code and contributions to it are handled just -like the rest of OpenDev. This means that anyone can contribute to -the installation and long-running maintenance of systems without shell -access, and anyone who is interested can provide feedback and -collaborate on code reviews. - -The configuration of every system operated by the infrastructure team -is managed by Ansible and driven by continuous integration and -deployment by Zuul. - - https://opendev.org/opendev/system-config - -All system configuration should be encoded in that repository so that -anyone may propose a change in the running configuration to Gerrit. - -Guide to CI and CD -================== - -All development work is based around Zuul jobs and a continuous -integration and development workflow. - -The starting point for all services is generally the playbooks and -roles kept in :git_file:`playbooks`. -Most playbooks are named ``service-.yaml`` and will indicate -which production areas they drive. - -These playbooks run on groups of hosts which are defined in -:git_file:`inventory/service/groups.yaml`. The production hosts are kept -in an inventory at :git_file:`inventory/base/hosts.yaml`. During -testing, these same playbooks are run against the test nodes. You can -note that the testing hosts are given names that match the group -configuration in the jobs defined in -:git_file:`zuul.d/system-config-run.yaml`. - -Deployment is run through a bastion host ``bridge.openstack.org``. -After changes are approved, Zuul will run Ansible on this host; which -will then connect to the production hosts and run the orchestration -using the latest committed code. The bridge is a special host because -it holds production secrets, such as passwords or API keys, and -unredacted logs. As many logs as possible are provided in the public -Zuul job results, but they need to be audited to ensure they do not -leak secrets and thus in some cases may not be published. - -For CI testing, each job creates a "fake" bridge, along with the -servers required for orchestration. Thus CI testing is performed by a -"nested" Ansible -- Zuul initially connects to the testing bridge node -and deploys it, and then this node runs its own Ansible that tests the -orchestration to the other testing nodes, simulating the production -environment. This is driven by playbooks kept in -:git_file:`playbooks/zuul`. Here you will also find testing -definitions of host variables that are kept secret for production -hosts. - -After the test environment is orchestrated, the -`testinfra `__ tests from -:git_file:`testinfra` are run. This validates the complete -orchestration testing environment; things such as ensuring user -creation, container readiness and service wellness checks are all -performed. - -.. _adding_new_server: - -Adding a New Server -=================== - -Creating a new server for your service requires discussion with the -OpenDev administrators to ensure donor resources are being used -effectively. - -* Hosts should only be configured by Ansible. Nonetheless, in some - cases SSH access can be granted. Add your public key to - :git_file:`inventory/base/group_vars/all.yaml` and include a stanza - like this in your server ``host_vars``:: - - extra_users: - - your_user_name - -* Add an RST file with documentation about the server and services in - :git_file:`doc/source` and add it to the index in that directory. +.. _ssh-access: SSH Access ==========