0a240e29fe
The only example of interpolated variables in the spec was in order to easily specify what kind of node should be used in a single-node job. Actually implementing the interpolation is not trivial, and this kind of interpolation adds some risk around making the configuration language more complicated, difficult to follow, and error-prone for users. Monty suggested that perhaps nodesets could become a first-class configuration option and negate the need for this particular example. That seems to work quite well -- allowing sets of nodes to be defined and referred to by name. Without this example to stand on, the variable interpolation feature doesn't seem worth it, at least at the moment. This change removes it and adds Nodesets as a configuration item. If we need it later, we can add it back. If we need a way to pass job variables to ansible, we can add a 'vars' job attribute, but treat it as a simple static structure, without interpolation. Change-Id: If2de49a698948279fb8deb88eb82bb4ae163f6b3
585 lines
20 KiB
ReStructuredText
585 lines
20 KiB
ReStructuredText
::
|
|
|
|
Copyright (c) 2015 Hewlett-Packard Development Company, L.P.
|
|
|
|
This work is licensed under a Creative Commons Attribution 3.0
|
|
Unported License.
|
|
http://creativecommons.org/licenses/by/3.0/legalcode
|
|
|
|
=======
|
|
Zuul v3
|
|
=======
|
|
|
|
Storyboard: https://storyboard.openstack.org/#!/story/2000305
|
|
|
|
As part of an effort to streamline Zuul and Nodepool into an
|
|
easier-to-use system that scales better and is more flexible, some
|
|
significant changes are proposed to both. The overall goals are:
|
|
|
|
* Make zuul scale to thousands of projects.
|
|
* Make Zuul more multi-tenant friendly.
|
|
* Make it easier to express complex scenarios in layout.
|
|
* Make nodepool more useful for non virtual nodes.
|
|
* Make nodepool more efficient for multi-node tests.
|
|
* Remove need for long-running slaves.
|
|
* Make it easier to use Zuul for continuous deployment.
|
|
* Support private installations using external test resources.
|
|
* Keep zuul simple.
|
|
|
|
Problem Description
|
|
===================
|
|
|
|
Nodepool
|
|
--------
|
|
|
|
Currently Nodepool is designed to supply single-use nodes to jobs. We
|
|
have extended it to support supplying multiple nodes to a single job
|
|
for multi-node tests, however the implementation of that is very
|
|
inefficient and will not scale with heavy use. The current system
|
|
uses a special multi-node label to indicate that a job requires the
|
|
number of nodes provided by that label. It means that pairs (or
|
|
triplets, or larger sets) of servers need to be created together,
|
|
which may cause delays while specific servers are created, and servers
|
|
may sit idle because they are destined only for use in multi-node jobs
|
|
and can not be used for jobs which require fewer (or more) nodes.
|
|
|
|
Nodepool also currently has no ability to supply inventory of nodes
|
|
which are not created and destroyed. It would be nice to allow
|
|
nodepool to mediate access to real hardware, for instance.
|
|
|
|
Zuul
|
|
----
|
|
|
|
Zuul is currently fundamentally a single-tenant application. Some
|
|
folks want to use it in a multi-tenant environment. Even within
|
|
OpenStack, we have use for multitenancy. OpenStack might be one
|
|
tenant, and each stackforge project might be another. Even without
|
|
the OpenStack/stackforge divide, we may still want the kind of
|
|
separation multi-tenancy can provide. Multi-tenancy should allow for
|
|
multiple tenants to have the same job and project names, but also to
|
|
share configuration if desired. Tenants should be able to define
|
|
their own pipelines, and optionally control some or all of their own
|
|
configuration.
|
|
|
|
OpenStack's Zuul configuration currently uses Jenkins and Jenkins Job
|
|
Builder (JJB) to define jobs. We use very few features of Jenkins and
|
|
Zuul was designed to facilitate our move away from Jenkins. The JJB
|
|
job definitions are complex, and some of the complexity comes from
|
|
specific Jenkins behaviors that we currently need to support.
|
|
Additionally, there is no support for orchestrating actions across
|
|
multiple test nodes.
|
|
|
|
Proposed Change
|
|
===============
|
|
|
|
Nodepool
|
|
--------
|
|
|
|
Nodepool should be made to support explicit node requests and
|
|
releases. That is to say, it should act more like its name -- a node
|
|
pool. It should support the existing model of single-use nodes as
|
|
well as long-term nodes that need mediated access.
|
|
|
|
Nodepool should implement the following gearman function to get one or
|
|
more nodes::
|
|
|
|
request-nodes:
|
|
input: {node-types: [list of node types],
|
|
request: a unique id for the request,
|
|
requestor: unique gearman worker id (eg zuul)}
|
|
|
|
When multiple nodes are requested together, nodepool will return nodes
|
|
within the same AZ of the same provider.
|
|
|
|
Requests for nodes will go into a FIFO queue and be satisfied in the
|
|
order received according to node availability. This should make
|
|
demand and allocation calculations much simpler.
|
|
|
|
A node type is simply a string such as 'trusty', that corresponds to
|
|
an entry in the nodepool config file.
|
|
|
|
The requestor is used to identify the system that is requesting the
|
|
node. To handle the case where the requesting system (eg Zuul) exits
|
|
abruptly and fails to return a node to the pool, Nodepool will reverse
|
|
the direction of gearman function invocation when supplying a set of
|
|
nodes. When completing allocation of a node, nodepool invokes the
|
|
following gearman function::
|
|
|
|
accept-nodes:<requestor>:
|
|
input: {nodes: [list of node records],
|
|
request: the unique id from the request}
|
|
output: {used: boolean}
|
|
|
|
If `zuul` was the requestor supplied with request-nodes, then the
|
|
actual function invoked would be `accept-nodes:zuul`.
|
|
|
|
A node record is a dictionary with the following records: id,
|
|
public-ipv4, public-ipv6, private-ipv4, private-ipv6, hostname,
|
|
node-type. The list should be in the same order as the types
|
|
specified in the request.
|
|
|
|
When the job is complete it will return a WORK_COMPLETE packet with
|
|
`used` set to true if any nodes were used. `used` will be set to
|
|
false if all nodes were unused (for instance, if Zuul no longer needs
|
|
the requested nodes). In this case, the nodes may be reassigned to
|
|
another request. If a WORK_FAIL packet is received, including due to
|
|
disconnection, the nodes will be treated as used.
|
|
|
|
Nodepool will then decide whether the nodes should be returned to the
|
|
pool, rebuilt, or deleted according to the type of node and current
|
|
demand.
|
|
|
|
This model is much more efficient for multi-node tests, where we will
|
|
no longer have to have special multinode labels. Instead the
|
|
multinode configuration can be much more ad-hoc and vary per job.
|
|
|
|
Nodepool should also allow the specification of static inventory of
|
|
non-dynamic nodes. These may be nodes that are running on real
|
|
hardware, for instance.
|
|
|
|
Zuul
|
|
----
|
|
|
|
Tenants
|
|
~~~~~~~
|
|
|
|
Zuul's main configuration should define tenants, and tenants should
|
|
specify config files to include. These include files should define
|
|
pipelines, jobs, and projects, all of which are namespaced to the
|
|
tenant (so different tenants may have different jobs with the same
|
|
names)::
|
|
|
|
### main.yaml
|
|
- tenant:
|
|
name: openstack
|
|
include:
|
|
- global_config.yaml
|
|
- openstack.yaml
|
|
|
|
Files may be included by more than one tenant, so common items can be
|
|
placed in a common file and referenced globally. This means that for,
|
|
eg, OpenStack, we can define pipelines and our base job definitions
|
|
(with logging info, etc) once, and include them in all of our tenants::
|
|
|
|
### main.yaml (continued)
|
|
- tenant:
|
|
name: openstack-infra
|
|
include:
|
|
- global_config.yaml
|
|
- infra.yaml
|
|
|
|
A tenant may optionally specify repos from which it may derive its
|
|
configuration. In this manner, a repo may keep its Zuul configuration
|
|
within its own repo. This would only happen if the main configuration
|
|
file specified that it is permitted::
|
|
|
|
### main.yaml (continued)
|
|
- tenant:
|
|
name: random-stackforge-project
|
|
include:
|
|
- global_config.yaml
|
|
source:
|
|
my-gerrit:
|
|
repos:
|
|
- stackforge/random # Specific project config is in-repo
|
|
|
|
Nodesets
|
|
~~~~~~~~
|
|
|
|
A significant focus of Zuul v3 is a close interaction with Nodepool to
|
|
both make running multi-node jobs simpler, as well as facilitate
|
|
running jobs on static resources. To that end, the node configuration
|
|
for a job is introduced as a first-class resource. This allows both
|
|
simple and complex node configurations to be independently defined and
|
|
then referenced by name in jobs::
|
|
|
|
### global_config.yaml
|
|
- nodeset:
|
|
name: precise
|
|
nodes:
|
|
- name: controller
|
|
image: ubuntu-precise
|
|
- nodeset:
|
|
name: trusty
|
|
nodes:
|
|
- name: controller
|
|
image: ubuntu-trusty
|
|
- nodeset:
|
|
name: multinode
|
|
nodes:
|
|
- name: controller
|
|
image: ubuntu-xenial
|
|
- name: compute
|
|
image: ubuntu-xenial
|
|
|
|
Jobs may either specify their own node configuration in-line, or refer
|
|
to a previously defined nodeset by name.
|
|
|
|
Jobs
|
|
~~~~
|
|
|
|
Jobs defined in-repo may not have access to the full feature set
|
|
(including some authorization features). They also may not override
|
|
existing jobs.
|
|
|
|
Job definitions continue to have the features in the current Zuul
|
|
layout, but they also take on some of the responsibilities currently
|
|
handled by the Jenkins (or other worker) definition::
|
|
|
|
### global_config.yaml
|
|
# Every tenant in the system has access to these jobs (because their
|
|
# tenant definition includes it).
|
|
- job:
|
|
name: base
|
|
timeout: 30m
|
|
nodes: precise
|
|
auth: # Auth may only be defined in central config, not in-repo
|
|
inherit: true # Child jobs may inherit these credentials
|
|
swift:
|
|
- container: logs
|
|
workspace: /opt/workspace # Where to place git repositories
|
|
post-run:
|
|
- archive-logs
|
|
|
|
Jobs have inheritance, and the above definition provides a base level
|
|
of functionality for all jobs. It sets a default timeout, requests a
|
|
single node (of type precise), and requests swift credentials to
|
|
upload logs. For security, job credentials are not available to be
|
|
inherited unless the 'inherit' flag is set to true. For example, a
|
|
job to publish a release may need credentials to upload to a
|
|
distribution site -- users should not be able to subclass that job and
|
|
use its credentials for another purpose.
|
|
|
|
Further jobs may extend and override the remaining parameters::
|
|
|
|
### global_config.yaml (continued)
|
|
# The python 2.7 unit test job
|
|
- job:
|
|
name: python27
|
|
parent: base
|
|
nodes: trusty
|
|
|
|
Our use of job names specific to projects is a holdover from when we
|
|
wanted long-lived slaves on Jenkins to efficiently re-use workspaces.
|
|
This hasn't been necessary for a while, though we have used this to
|
|
our advantage when collecting stats and reports. However, job
|
|
configuration can be simplified greatly if we simply have a job that
|
|
runs the python 2.7 unit tests which can be used for any project. To
|
|
the degree that we want to know how often this job failed on nova, we
|
|
can add that information back in when reporting statistics. Jobs may
|
|
have multiple aspects to accomodate differences among branches, etc.::
|
|
|
|
### global_config.yaml (continued)
|
|
# Version that is run for changes on stable/diablo
|
|
- job:
|
|
name: python27
|
|
parent: base
|
|
branches: stable/diablo
|
|
nodes:
|
|
- name: controller
|
|
image: ubuntu-lucid
|
|
|
|
# Version that is run for changes on stable/juno
|
|
- job:
|
|
name: python27
|
|
parent: base
|
|
branches: stable/juno # Could be combined into previous with regex
|
|
nodes: precise # if concept of "best match" is defined
|
|
|
|
Jobs may specify that they require more than one node::
|
|
|
|
### global_config.yaml (continued)
|
|
- job:
|
|
name: devstack-multinode
|
|
parent: base
|
|
nodes: multinode
|
|
|
|
Jobs defined centrally (i.e., not in-repo) may specify auth info::
|
|
|
|
### global_config.yaml (continued)
|
|
- job:
|
|
name: pypi-upload
|
|
parent: base
|
|
auth:
|
|
password:
|
|
pypi-password: pypi-password
|
|
# This looks up 'pypi-password' from an encrypted yaml file
|
|
# and adds it into variables for the job
|
|
|
|
Note that this job may not be inherited from because of the auth
|
|
information.
|
|
|
|
Projects
|
|
~~~~~~~~
|
|
|
|
Pipeline definitions are similar to the current syntax, except that it
|
|
supports specifying additional information for jobs in the context of
|
|
a given project and pipeline. For instance, rather than specifying
|
|
that a job is globally non-voting, you may specify that it is
|
|
non-voting for a given project in a given pipeline::
|
|
|
|
### openstack.yaml
|
|
- project:
|
|
name: openstack/nova
|
|
gate:
|
|
queue: integrated # Shared queues are manually built
|
|
jobs:
|
|
- python27 # Runs version of job appropriate to branch
|
|
- pep8:
|
|
nodes: trusty # override the node type for this project
|
|
- devstack
|
|
- devstack-deprecated-feature:
|
|
branches: stable/juno # Only run on stable/juno changes
|
|
voting: false # Non-voting
|
|
post:
|
|
jobs:
|
|
- tarball:
|
|
jobs:
|
|
- pypi-upload
|
|
|
|
Templates are still supported. If a project lists a job that is
|
|
defined in a template that is also applied to that project, the
|
|
project-local specification of the job will modify that supplied by
|
|
the template.
|
|
|
|
Currently unique job names are used to build shared change queues.
|
|
Since job names will no longer be unique, shared queues must be
|
|
manually constructed by assigning them a name. Projects with the same
|
|
queue name for the same pipeline will have a shared queue.
|
|
|
|
A subset of functionality is available to projects that are permitted
|
|
to use in-repo configuration::
|
|
|
|
### stackforge/random/.zuul.yaml
|
|
- job:
|
|
name: random-job
|
|
parent: base # From global config; gets us logs
|
|
nodes: precise
|
|
|
|
- project:
|
|
name: stackforge/random
|
|
gate:
|
|
jobs:
|
|
- python27 # From global config
|
|
- random-job # Flom local config
|
|
|
|
Ansible
|
|
~~~~~~~
|
|
|
|
The actual execution of jobs will continue to be distributed to
|
|
workers over Gearman. Therefore the actual implementation of how jobs
|
|
are executed will remain pluggable, however, the zuul-gearman protocol
|
|
will need to change. Because the system needs to perform coordinated
|
|
tasks on one or more remote systems, the initial implementation of the
|
|
workers will use Ansible, which is particularly suited to that job.
|
|
|
|
The executable content of jobs should be defined as ansible playbooks.
|
|
Playbooks can be fairly simple and might consist of little more than
|
|
"run this shell script" for those who are not otherwise interested in
|
|
ansible::
|
|
|
|
### stackforge/random/playbooks/random-job.yaml
|
|
---
|
|
hosts: controller
|
|
tasks:
|
|
- shell: run_some_tests.sh
|
|
|
|
Global jobs may define ansible roles for common functions::
|
|
|
|
### openstack-infra/zuul-playbooks/python27.yaml
|
|
---
|
|
hosts: controller
|
|
roles:
|
|
- tox:
|
|
env: py27
|
|
|
|
Because ansible has well-articulated multi-node orchestration
|
|
features, this permits very expressive job definitions for multi-node
|
|
tests. A playbook can specify different roles to apply to the
|
|
different nodes that the job requested::
|
|
|
|
### openstack-infra/zuul-playbooks/devstack-multinode.yaml
|
|
---
|
|
hosts: controller
|
|
roles:
|
|
- devstack
|
|
---
|
|
hosts: compute
|
|
roles:
|
|
- devstack-compute
|
|
|
|
Additionally, if a project is already defining ansible roles for its
|
|
deployment, then those roles may be easily applied in testing, making
|
|
CI even closer to CD.
|
|
|
|
The pre- and post-run entries in the job definition might also apply
|
|
to ansible playbooks and can be used to simplify job setup and
|
|
cleanup::
|
|
|
|
### openstack-infra/zuul-playbooks/archive-logs.yaml
|
|
---
|
|
hosts: all
|
|
roles:
|
|
- archive-logs: "/opt/workspace/logs"
|
|
|
|
Execution
|
|
~~~~~~~~~
|
|
|
|
A new Zuul component would be created to execute jobs. Rather than
|
|
running a worker process on each node (which requires installing
|
|
software on the test node, and establishing and maintaining network
|
|
connectivity back to Zuul, and the ability to coordinate actions
|
|
across nodes for multi-node tests), this new component will pick up
|
|
accept jobs from Zuul, and for each one, write an ansible inventory
|
|
file with the node and variable information, and then execute the
|
|
ansible playbook for that job. This means that the new Zuul component
|
|
will maintain ssh connections to all hosts currently running a job.
|
|
This could become a bottleneck, but ansible and ssh have been known to
|
|
scale to a large number of simultaneous hosts, and this component may
|
|
be scaled horizontally. It should be simple enough that it could even
|
|
be automatically scaled if needed. In turn, however, this does make
|
|
node configuration simpler (test nodes need only have an ssh public
|
|
key installed) and makes tests behave more like deployment.
|
|
|
|
To support the use case where the Zuul control plane should not be
|
|
accessible by the workers (for instance, because the control plane is
|
|
on a private network while the workers are in a public cloud), the
|
|
direction of transfer of changes under test to the workers will be
|
|
reversed.
|
|
|
|
Instead of workers fetching from zuul-mergers, the new zuul-launcher
|
|
will take on the task of calculating merges as well as running
|
|
ansible.
|
|
|
|
|
|
Continuous Deployment
|
|
~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
Special consideration is needed in order to use Zuul to drive
|
|
continuous deployment of development or production systems. Rather
|
|
than specifying that Zuul should obtain a node from nodepool in order
|
|
to run a job, it may be configured to simply execute an ansible task
|
|
on a specified host::
|
|
|
|
- job:
|
|
name: run-puppet-apply
|
|
parent: base
|
|
host: review.openstack.org
|
|
fingerprint: 4a:28:cb:03:6a:d6:79:0b:cc:dc:60:ae:6a:62:cf:5b
|
|
|
|
Because any configuration of the host and credential information is
|
|
potentially accessible to anyone able to read the Zuul configuration
|
|
(which is everyone for OpenStack's configuration) and therefore could
|
|
be copied to their own section of Zuul's configuration, users must add
|
|
one of two public keys to the server in order for the job to function.
|
|
Zuul will generate an SSH keypair for every tenant as well as every
|
|
project. If a user trusts anyone able to make configuration changes
|
|
to their tenant, then they may use Zuul's public key for their tenant.
|
|
If they are only able to trust their own project configuration in
|
|
Zuul, they may add Zuul's public key for that specific project. Zuul
|
|
will make all public keys available at known HTTP addresses so that
|
|
users may retrieve them. When executing such a job, Zuul will try the
|
|
project and tenant SSH keys in order.
|
|
|
|
Tenant Isolation
|
|
~~~~~~~~~~~~~~~~
|
|
|
|
In order to prevent users of one Zuul tenant from accessing the git
|
|
repositories of other tenants, Zuul will no longer consider the git
|
|
repositories it manages to be public. This could be solved by passing
|
|
credentials to the workers for them to use when fetching changes,
|
|
however, an additional consideration is the desire to have workers
|
|
fully network isolated from the Zuul control plane.
|
|
|
|
Instead of workers fetching from zuul-mergers, the new zuul-launcher
|
|
will take on the task of calculating merges as well as running
|
|
ansible. The launcher will then be responsible for placing prepared
|
|
versions of requested repositories onto the worker.
|
|
|
|
Status reporting will also be tenant isolated, however without
|
|
HTTP-level access controls, additional measures may be needed to
|
|
prevent tenants from accessing the status of other tenants.
|
|
Eventually, Zuul may support an authenticated REST API that will solve
|
|
this problem natively.
|
|
|
|
Alternatives
|
|
------------
|
|
|
|
Continuing with the status quo is an alternative, as well as
|
|
continuing the process of switching to Turbo Hipster to replace
|
|
Jenkins. However, this addresses only some of the goals stated at the
|
|
top.
|
|
|
|
Implementation
|
|
==============
|
|
|
|
Assignee(s)
|
|
-----------
|
|
|
|
Primary assignee:
|
|
* corvus
|
|
|
|
Also:
|
|
* jhesketh
|
|
* mordred
|
|
|
|
Gerrit Branch
|
|
-------------
|
|
|
|
Nodepool and Zuul will both be branched for development related to
|
|
this spec. The "master" branches will continue to receive patches
|
|
related to maintaining the current versions, and the "feature/zuulv3"
|
|
branches will receive patches related to this spec. The .gitreview
|
|
files will be updated to submit to the correct branches by default.
|
|
|
|
Work Items
|
|
----------
|
|
|
|
* Modify nodepool to support new allocation and distribution (mordred)
|
|
* Modify zuul to support new syntax and isolation (corvus)
|
|
* Create zuul launcher (jhesketh)
|
|
* Prepare basic infra ansible roles
|
|
* Translate OpenStack JJB config to ansible
|
|
|
|
Repositories
|
|
------------
|
|
|
|
We may create new repositories for ansible roles, or they may live in
|
|
project-config.
|
|
|
|
Servers
|
|
-------
|
|
|
|
We may create more combined zuul-launcher/mergers.
|
|
|
|
DNS Entries
|
|
-----------
|
|
|
|
No changes other than needed for additional servers.
|
|
|
|
Documentation
|
|
-------------
|
|
|
|
This will require changes to Nodepool and Zuul's documentation, as
|
|
well as infra-manual.
|
|
|
|
Security
|
|
--------
|
|
|
|
No substantial changes to security around the Zuul server; use of Zuul
|
|
private keys for access to remote hosts by Zuul has security
|
|
implications but will not be immediately used by OpenStack
|
|
Infrastructure.
|
|
|
|
Testing
|
|
-------
|
|
|
|
Existing nodepool and Zuul tests will need to be adapted.
|
|
Configuration will be different, however, much functionality should be
|
|
the same, so many functional tests should have direct equivalencies.
|
|
|
|
Dependencies
|
|
============
|
|
|
|
None.
|