Merge "Add Zuul v3 spec."
This commit is contained in:
commit
f61f151111
@ -54,6 +54,7 @@ permits.
|
||||
specs/storyboard_task_branches
|
||||
specs/trystack-site
|
||||
specs/zuul_split
|
||||
specs/zuulv3
|
||||
|
||||
Implemented Design Specifications
|
||||
=================================
|
||||
|
557
specs/zuulv3.rst
Normal file
557
specs/zuulv3.rst
Normal file
@ -0,0 +1,557 @@
|
||||
::
|
||||
|
||||
Copyright (c) 2015 Hewlett-Packard Development Company, L.P.
|
||||
|
||||
This work is licensed under a Creative Commons Attribution 3.0
|
||||
Unported License.
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
=======
|
||||
Zuul v3
|
||||
=======
|
||||
|
||||
Storyboard: https://storyboard.openstack.org/#!/story/2000305
|
||||
|
||||
As part of an effort to streamline Zuul and Nodepool into an
|
||||
easier-to-use system that scales better and is more flexible, some
|
||||
significant changes are proposed to both. The overall goals are:
|
||||
|
||||
* Make zuul scale to thousands of projects.
|
||||
* Make Zuul more multi-tenant friendly.
|
||||
* Make it easier to express complex scenarios in layout.
|
||||
* Make nodepool more useful for non virtual nodes.
|
||||
* Make nodepool more efficient for multi-node tests.
|
||||
* Remove need for long-running slaves.
|
||||
* Make it easier to use Zuul for continuous deployment.
|
||||
* Support private installations using external test resources.
|
||||
* Keep zuul simple.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
Nodepool
|
||||
--------
|
||||
|
||||
Currently Nodepool is designed to supply single-use nodes to jobs. We
|
||||
have extended it to support supplying multiple nodes to a single job
|
||||
for multi-node tests, however the implementation of that is very
|
||||
inefficient and will not scale with heavy use. The current system
|
||||
uses a special multi-node label to indicate that a job requires the
|
||||
number of nodes provided by that label. It means that pairs (or
|
||||
triplets, or larger sets) of servers need to be created together,
|
||||
which may cause delays while specific servers are created, and servers
|
||||
may sit idle because they are destined only for use in multi-node jobs
|
||||
and can not be used for jobs which require fewer (or more) nodes.
|
||||
|
||||
Nodepool also currently has no ability to supply inventory of nodes
|
||||
which are not created and destroyed. It would be nice to allow
|
||||
nodepool to mediate access to real hardware, for instance.
|
||||
|
||||
Zuul
|
||||
----
|
||||
|
||||
Zuul is currently fundamentally a single-tenant application. Some
|
||||
folks want to use it in a multi-tenant environment. Even within
|
||||
OpenStack, we have use for multitenancy. OpenStack might be one
|
||||
tenant, and each stackforge project might be another. Even without
|
||||
the OpenStack/stackforge divide, we may still want the kind of
|
||||
separation multi-tenancy can provide. Multi-tenancy should allow for
|
||||
multiple tenants to have the same job and project names, but also to
|
||||
share configuration if desired. Tenants should be able to define
|
||||
their own pipelines, and optionally control some or all of their own
|
||||
configuration.
|
||||
|
||||
OpenStack's Zuul configuration currently uses Jenkins and Jenkins Job
|
||||
Builder (JJB) to define jobs. We use very few features of Jenkins and
|
||||
Zuul was designed to facilitate our move away from Jenkins. The JJB
|
||||
job definitions are complex, and some of the complexity comes from
|
||||
specific Jenkins behaviors that we currently need to support.
|
||||
Additionally, there is no support for orchestrating actions across
|
||||
multiple test nodes.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Nodepool
|
||||
--------
|
||||
|
||||
Nodepool should be made to support explicit node requests and
|
||||
releases. That is to say, it should act more like its name -- a node
|
||||
pool. It should support the existing model of single-use nodes as
|
||||
well as long-term nodes that need mediated access.
|
||||
|
||||
Nodepool should implement the following gearman function to get one or
|
||||
more nodes::
|
||||
|
||||
request-nodes:
|
||||
input: {node-types: [list of node types],
|
||||
request: a unique id for the request,
|
||||
requestor: unique gearman worker id (eg zuul)}
|
||||
|
||||
When multiple nodes are requested together, nodepool will return nodes
|
||||
within the same AZ of the same provider.
|
||||
|
||||
Requests for nodes will go into a FIFO queue and be satisfied in the
|
||||
order received according to node availability. This should make
|
||||
demand and allocation calculations much simpler.
|
||||
|
||||
A node type is simply a string such as 'trusty', that corresponds to
|
||||
an entry in the nodepool config file.
|
||||
|
||||
The requestor is used to identify the system that is requesting the
|
||||
node. To handle the case where the requesting system (eg Zuul) exits
|
||||
abruptly and fails to return a node to the pool, Nodepool will reverse
|
||||
the direction of gearman function invocation when supplying a set of
|
||||
nodes. When completing allocation of a node, nodepool invokes the
|
||||
following gearman function::
|
||||
|
||||
accept-nodes:<requestor>:
|
||||
input: {nodes: [list of node records],
|
||||
request: the unique id from the request}
|
||||
output: {used: boolean}
|
||||
|
||||
If `zuul` was the requestor supplied with request-nodes, then the
|
||||
actual function invoked would be `accept-nodes:zuul`.
|
||||
|
||||
A node record is a dictionary with the following records: id,
|
||||
public-ipv4, public-ipv6, private-ipv4, private-ipv6, hostname,
|
||||
node-type. The list should be in the same order as the types
|
||||
specified in the request.
|
||||
|
||||
When the job is complete it will return a WORK_COMPLETE packet with
|
||||
`used` set to true if any nodes were used. `used` will be set to
|
||||
false if all nodes were unused (for instance, if Zuul no longer needs
|
||||
the requested nodes). In this case, the nodes may be reassigned to
|
||||
another request. If a WORK_FAIL packet is received, including due to
|
||||
disconnection, the nodes will be treated as used.
|
||||
|
||||
Nodepool will then decide whether the nodes should be returned to the
|
||||
pool, rebuilt, or deleted according to the type of node and current
|
||||
demand.
|
||||
|
||||
This model is much more efficient for multi-node tests, where we will
|
||||
no longer have to have special multinode labels. Instead the
|
||||
multinode configuration can be much more ad-hoc and vary per job.
|
||||
|
||||
Nodepool should also allow the specification of static inventory of
|
||||
non-dynamic nodes. These may be nodes that are running on real
|
||||
hardware, for instance.
|
||||
|
||||
Zuul
|
||||
----
|
||||
|
||||
Tenants
|
||||
~~~~~~~
|
||||
|
||||
Zuul's main configuration should define tenants, and tenants should
|
||||
specify config files to include. These include files should define
|
||||
pipelines, jobs, and projects, all of which are namespaced to the
|
||||
tenant (so different tenants may have different jobs with the same
|
||||
names)::
|
||||
|
||||
### main.yaml
|
||||
- tenant:
|
||||
name: openstack
|
||||
include:
|
||||
- global_config.yaml
|
||||
- openstack.yaml
|
||||
|
||||
Files may be included by more than one tenant, so common items can be
|
||||
placed in a common file and referenced globally. This means that for,
|
||||
eg, OpenStack, we can define pipelines and our base job definitions
|
||||
(with logging info, etc) once, and include them in all of our tenants::
|
||||
|
||||
### main.yaml (continued)
|
||||
- tenant:
|
||||
name: openstack-infra
|
||||
include:
|
||||
- global_config.yaml
|
||||
- infra.yaml
|
||||
|
||||
A tenant may optionally specify repos from which it may derive its
|
||||
configuration. In this manner, a repo may keep its Zuul configuration
|
||||
within its own repo. This would only happen if the main configuration
|
||||
file specified that it is permitted::
|
||||
|
||||
### main.yaml (continued)
|
||||
- tenant:
|
||||
name: random-stackforge-project
|
||||
include:
|
||||
- global_config.yaml
|
||||
source:
|
||||
my-gerrit:
|
||||
repos:
|
||||
- stackforge/random # Specific project config is in-repo
|
||||
|
||||
Jobs
|
||||
~~~~
|
||||
|
||||
Jobs defined in-repo may not have access to the full feature set
|
||||
(including some authorization features). They also may not override
|
||||
existing jobs.
|
||||
|
||||
Job definitions continue to have the features in the current Zuul
|
||||
layout, but they also take on some of the responsibilities currently
|
||||
handled by the Jenkins (or other worker) definition::
|
||||
|
||||
### global_config.yaml
|
||||
# Every tenant in the system has access to these jobs (because their
|
||||
# tenant definition includes it).
|
||||
- job:
|
||||
name: base
|
||||
timeout: 30m
|
||||
node: precise # Just a variable for later use
|
||||
nodes: # The operative list of nodes
|
||||
- name: controller
|
||||
image: {node} # Substitute the variable
|
||||
auth: # Auth may only be defined in central config, not in-repo
|
||||
inherit: true # Child jobs may inherit these credentials
|
||||
swift:
|
||||
- container: logs
|
||||
workspace: /opt/workspace # Where to place git repositories
|
||||
post-run:
|
||||
- archive-logs
|
||||
|
||||
Jobs have inheritance, and the above definition provides a base level
|
||||
of functionality for all jobs. It sets a default timeout, requests a
|
||||
single node (of type precise), and requests swift credentials to
|
||||
upload logs. For security, job credentials are not available to be
|
||||
inherited unless the 'inherit' flag is set to true. For example, a
|
||||
job to publish a release may need credentials to upload to a
|
||||
distribution site -- users should not be able to subclass that job and
|
||||
use its credentials for another purpose.
|
||||
|
||||
Further jobs may extend and override the remaining parameters::
|
||||
|
||||
### global_config.yaml (continued)
|
||||
# The python 2.7 unit test job
|
||||
- job:
|
||||
name: python27
|
||||
parent: base
|
||||
node: trusty
|
||||
|
||||
Our use of job names specific to projects is a holdover from when we
|
||||
wanted long-lived slaves on Jenkins to efficiently re-use workspaces.
|
||||
This hasn't been necessary for a while, though we have used this to
|
||||
our advantage when collecting stats and reports. However, job
|
||||
configuration can be simplified greatly if we simply have a job that
|
||||
runs the python 2.7 unit tests which can be used for any project. To
|
||||
the degree that we want to know how often this job failed on nova, we
|
||||
can add that information back in when reporting statistics. Jobs may
|
||||
have multiple aspects to accomodate differences among branches, etc.::
|
||||
|
||||
### global_config.yaml (continued)
|
||||
# Version that is run for changes on stable/icehouse
|
||||
- job:
|
||||
name: python27
|
||||
parent: base
|
||||
branch: stable/icehouse
|
||||
node: precise
|
||||
|
||||
# Version that is run for changes on stable/juno
|
||||
- job:
|
||||
name: python27
|
||||
parent: base
|
||||
branch: stable/juno # Could be combined into previous with regex
|
||||
node: precise # if concept of "best match" is defined
|
||||
|
||||
Jobs may specify that they require more than one node::
|
||||
|
||||
### global_config.yaml (continued)
|
||||
- job:
|
||||
name: devstack-multinode
|
||||
parent: base
|
||||
node: trusty # could do same branch mapping as above
|
||||
nodes:
|
||||
- name: controller
|
||||
image: {node}
|
||||
- name: compute
|
||||
image: {node}
|
||||
|
||||
Jobs defined centrally (i.e., not in-repo) may specify auth info::
|
||||
|
||||
### global_config.yaml (continued)
|
||||
- job:
|
||||
name: pypi-upload
|
||||
parent: base
|
||||
auth:
|
||||
password:
|
||||
pypi-password: pypi-password
|
||||
# This looks up 'pypi-password' from an encrypted yaml file
|
||||
# and adds it into variables for the job
|
||||
|
||||
Note that this job may not be inherited from because of the auth
|
||||
information.
|
||||
|
||||
Projects
|
||||
~~~~~~~~
|
||||
|
||||
Pipeline definitions are similar to the current syntax, except that it
|
||||
supports specifying additional information for jobs in the context of
|
||||
a given project and pipeline. For instance, rather than specifying
|
||||
that a job is globally non-voting, you may specify that it is
|
||||
non-voting for a given project in a given pipeline::
|
||||
|
||||
### openstack.yaml
|
||||
- project:
|
||||
name: openstack/nova
|
||||
gate:
|
||||
queue: integrated # Shared queues are manually built
|
||||
jobs:
|
||||
- python27 # Runs version of job appropriate to branch
|
||||
- pep8:
|
||||
node: trusty # override the node type for this project
|
||||
- devstack
|
||||
- devstack-deprecated-feature:
|
||||
branch: stable/juno # Only run on stable/juno changes
|
||||
voting: false # Non-voting
|
||||
post:
|
||||
jobs:
|
||||
- tarball:
|
||||
jobs:
|
||||
- pypi-upload
|
||||
|
||||
Templates are still supported. If a project lists a job that is
|
||||
defined in a template that is also applied to that project, the
|
||||
project-local specification of the job will modify that supplied by
|
||||
the template.
|
||||
|
||||
Currently unique job names are used to build shared change queues.
|
||||
Since job names will no longer be unique, shared queues must be
|
||||
manually constructed by assigning them a name. Projects with the same
|
||||
queue name for the same pipeline will have a shared queue.
|
||||
|
||||
A subset of functionality is available to projects that are permitted
|
||||
to use in-repo configuration::
|
||||
|
||||
### stackforge/random/.zuul.yaml
|
||||
- job:
|
||||
name: random-job
|
||||
parent: base # From global config; gets us logs
|
||||
node: precise
|
||||
|
||||
- project:
|
||||
name: stackforge/random
|
||||
gate:
|
||||
jobs:
|
||||
- python27 # From global config
|
||||
- random-job # Flom local config
|
||||
|
||||
Ansible
|
||||
~~~~~~~
|
||||
|
||||
The actual execution of jobs will continue to be distributed to
|
||||
workers over Gearman. Therefore the actual implementation of how jobs
|
||||
are executed will remain pluggable, however, the zuul-gearman protocol
|
||||
will need to change. Because the system needs to perform coordinated
|
||||
tasks on one or more remote systems, the initial implementation of the
|
||||
workers will use Ansible, which is particularly suited to that job.
|
||||
|
||||
The executable content of jobs should be defined as ansible playbooks.
|
||||
Playbooks can be fairly simple and might consist of little more than
|
||||
"run this shell script" for those who are not otherwise interested in
|
||||
ansible::
|
||||
|
||||
### stackforge/random/playbooks/random-job.yaml
|
||||
---
|
||||
hosts: controller
|
||||
tasks:
|
||||
- shell: run_some_tests.sh
|
||||
|
||||
Global jobs may define ansible roles for common functions::
|
||||
|
||||
### openstack-infra/zuul-playbooks/python27.yaml
|
||||
---
|
||||
hosts: controller
|
||||
roles:
|
||||
- tox:
|
||||
env: py27
|
||||
|
||||
Because ansible has well-articulated multi-node orchestration
|
||||
features, this permits very expressive job definitions for multi-node
|
||||
tests. A playbook can specify different roles to apply to the
|
||||
different nodes that the job requested::
|
||||
|
||||
### openstack-infra/zuul-playbooks/devstack-multinode.yaml
|
||||
---
|
||||
hosts: controller
|
||||
roles:
|
||||
- devstack
|
||||
---
|
||||
hosts: compute
|
||||
roles:
|
||||
- devstack-compute
|
||||
|
||||
Additionally, if a project is already defining ansible roles for its
|
||||
deployment, then those roles may be easily applied in testing, making
|
||||
CI even closer to CD.
|
||||
|
||||
The pre- and post-run entries in the job definition might also apply
|
||||
to ansible playbooks and can be used to simplify job setup and
|
||||
cleanup::
|
||||
|
||||
### openstack-infra/zuul-playbooks/archive-logs.yaml
|
||||
---
|
||||
hosts: all
|
||||
roles:
|
||||
- archive-logs: "/opt/workspace/logs"
|
||||
|
||||
Execution
|
||||
~~~~~~~~~
|
||||
|
||||
A new Zuul component would be created to execute jobs. Rather than
|
||||
running a worker process on each node (which requires installing
|
||||
software on the test node, and establishing and maintaining network
|
||||
connectivity back to Zuul, and the ability to coordinate actions
|
||||
across nodes for multi-node tests), this new component will pick up
|
||||
accept jobs from Zuul, and for each one, write an ansible inventory
|
||||
file with the node and variable information, and then execute the
|
||||
ansible playbook for that job. This means that the new Zuul component
|
||||
will maintain ssh connections to all hosts currently running a job.
|
||||
This could become a bottleneck, but ansible and ssh have been known to
|
||||
scale to a large number of simultaneous hosts, and this component may
|
||||
be scaled horizontally. It should be simple enough that it could even
|
||||
be automatically scaled if needed. In turn, however, this does make
|
||||
node configuration simpler (test nodes need only have an ssh public
|
||||
key installed) and makes tests behave more like deployment.
|
||||
|
||||
To support the use case where the Zuul control plane should not be
|
||||
accessible by the workers (for instance, because the control plane is
|
||||
on a private network while the workers are in a public cloud), the
|
||||
direction of transfer of changes under test to the workers will be
|
||||
reversed.
|
||||
|
||||
Instead of workers fetching from zuul-mergers, the new zuul-launcher
|
||||
will take on the task of calculating merges as well as running
|
||||
ansible.
|
||||
|
||||
|
||||
Continuous Deployment
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Special consideration is needed in order to use Zuul to drive
|
||||
continuous deployment of development or production systems. Rather
|
||||
than specifying that Zuul should obtain a node from nodepool in order
|
||||
to run a job, it may be configured to simply execute an ansible task
|
||||
on a specified host::
|
||||
|
||||
- job:
|
||||
name: run-puppet-apply
|
||||
parent: base
|
||||
host: review.openstack.org
|
||||
fingerprint: 4a:28:cb:03:6a:d6:79:0b:cc:dc:60:ae:6a:62:cf:5b
|
||||
|
||||
Because any configuration of the host and credential information is
|
||||
potentially accessible to anyone able to read the Zuul configuration
|
||||
(which is everyone for OpenStack's configuration) and therefore could
|
||||
be copied to their own section of Zuul's configuration, users must add
|
||||
one of two public keys to the server in order for the job to function.
|
||||
Zuul will generate an SSH keypair for every tenant as well as every
|
||||
project. If a user trusts anyone able to make configuration changes
|
||||
to their tenant, then they may use Zuul's public key for their tenant.
|
||||
If they are only able to trust their own project configuration in
|
||||
Zuul, they may add Zuul's public key for that specific project. Zuul
|
||||
will make all public keys available at known HTTP addresses so that
|
||||
users may retrieve them. When executing such a job, Zuul will try the
|
||||
project and tenant SSH keys in order.
|
||||
|
||||
Tenant Isolation
|
||||
~~~~~~~~~~~~~~~~
|
||||
|
||||
In order to prevent users of one Zuul tenant from accessing the git
|
||||
repositories of other tenants, Zuul will no longer consider the git
|
||||
repositories it manages to be public. This could be solved by passing
|
||||
credentials to the workers for them to use when fetching changes,
|
||||
however, an additional consideration is the desire to have workers
|
||||
fully network isolated from the Zuul control plane.
|
||||
|
||||
Instead of workers fetching from zuul-mergers, the new zuul-launcher
|
||||
will take on the task of calculating merges as well as running
|
||||
ansible. The launcher will then be responsible for placing prepared
|
||||
versions of requested repositories onto the worker.
|
||||
|
||||
Status reporting will also be tenant isolated, however without
|
||||
HTTP-level access controls, additional measures may be needed to
|
||||
prevent tenants from accessing the status of other tenants.
|
||||
Eventually, Zuul may support an authenticated REST API that will solve
|
||||
this problem natively.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
Continuing with the status quo is an alternative, as well as
|
||||
continuing the process of switching to Turbo Hipster to replace
|
||||
Jenkins. However, this addresses only some of the goals stated at the
|
||||
top.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
* corvus
|
||||
|
||||
Also:
|
||||
* jhesketh
|
||||
|
||||
Gerrit Topic
|
||||
------------
|
||||
|
||||
Use Gerrit topic "zuulv3" for all patches related to this spec.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
git-review -t zuulv3
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Modify nodepool to support new allocation and distribution
|
||||
* Modify zuul to support new syntax and isolation
|
||||
* Create zuul launcher
|
||||
* Prepare basic infra ansible roles
|
||||
* Translate OpenStack JJB config to ansible
|
||||
|
||||
Repositories
|
||||
------------
|
||||
|
||||
We may create new repositories for ansible roles, or they may live in
|
||||
project-config.
|
||||
|
||||
Servers
|
||||
-------
|
||||
|
||||
We may create more combined zuul-launcher/mergers.
|
||||
|
||||
DNS Entries
|
||||
-----------
|
||||
|
||||
No changes other than needed for additional servers.
|
||||
|
||||
Documentation
|
||||
-------------
|
||||
|
||||
This will require changes to Nodepool and Zuul's documentation, as
|
||||
well as infra-manual.
|
||||
|
||||
Security
|
||||
--------
|
||||
|
||||
No substantial changes to security around the Zuul server; use of Zuul
|
||||
private keys for access to remote hosts by Zuul has security
|
||||
implications but will not be immediately used by OpenStack
|
||||
Infrastructure.
|
||||
|
||||
Testing
|
||||
-------
|
||||
|
||||
Existing nodepool and Zuul tests will need to be adapted.
|
||||
Configuration will be different, however, much functionality should be
|
||||
the same, so many functional tests should have direct equivalencies.
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None.
|
Loading…
x
Reference in New Issue
Block a user