e589f986d8
We moved publishing of this repo to https://docs.opendev.org/opendev/infra-specs/latest/, update all links. Also, remove unused import of autodoc from doc/source/conf.py Change-Id: I156b12d0e49362563cd8c39f30475bafe889c23b
461 lines
18 KiB
ReStructuredText
461 lines
18 KiB
ReStructuredText
.. code-block:: text
|
|
|
|
Copyright 2019 Red Hat, Inc.
|
|
|
|
This work is licensed under a Creative Commons Attribution 3.0
|
|
Unported License.
|
|
http://creativecommons.org/licenses/by/3.0/legalcode
|
|
|
|
..
|
|
This template should be in ReSTructured text. Please do not delete
|
|
any of the sections in this template. If you have nothing to say
|
|
for a whole section, just write: "None". For help with syntax, see
|
|
http://sphinx-doc.org/rest.html To test out your formatting, see
|
|
http://www.tele3.cz/jbar/rest/rest.html
|
|
|
|
===================================
|
|
Use letsencrypt for infra SSL needs
|
|
===================================
|
|
|
|
Incorporate https://letsencrypt.org/ into our configuration
|
|
management.
|
|
|
|
Problem Description
|
|
===================
|
|
|
|
Currently SSL certificates for infrastructure services are manually
|
|
purchased at some cost, renewed by hand and provisioned into
|
|
configuration management. The process is costly, work-intensive and
|
|
does not completely match our usual infrastructure-as-code workflows.
|
|
The overheads are a barrier to getting encryption rolled out across
|
|
all services.
|
|
|
|
We would like to replace manual provision of purchased certificates
|
|
with those provided by https://letsencrypt.org/. The solution should
|
|
make certificate provision and renewal completely automated and
|
|
integrate with the existing configuration management environment.
|
|
|
|
What is the existing situation?
|
|
-------------------------------
|
|
|
|
* Certificates manually renewed.
|
|
|
|
* Certificate key data is manually inserted into Ansible variables
|
|
on bridge.o.o in the key storage area
|
|
|
|
* Key material is then deployed as part of usual periodic
|
|
ansible/puppet run on host; usually by puppet as a hiera variable
|
|
(e.g. ``ssl_cert => hiera('foo_ssl_cert')``) but possibly by ansible
|
|
directly.
|
|
|
|
* Self-signed certificates are created for a number of less
|
|
"important" services, such as dev servers, or nodepool status pages.
|
|
Primarily due to the cost an inconvenience of creating certificates.
|
|
|
|
What can we use letsencrypt for?
|
|
---------------------------------
|
|
|
|
Domain validation certificates. This means you have proved you
|
|
control the domain to letsencrypt. They do not offer higher level
|
|
verification such as `Extended Validation
|
|
<https://en.wikipedia.org/wiki/Extended_Validation_Certificate>`__
|
|
which prove the identity of the owner.
|
|
|
|
Our needs are currently exclusively domain validation certificates.
|
|
|
|
What about non-http services?
|
|
-----------------------------
|
|
|
|
Although not a initial target, non-https services like zookeeper and
|
|
gearman should be possible.
|
|
|
|
Zookeeper for example uses `JKS
|
|
<https://cwiki-test.apache.org/confluence/display/ZOOKEEPER/ZooKeeper+SSL+User+Guide>`__
|
|
which it should be possible to `import
|
|
<https://community.letsencrypt.org/t/tutorial-java-keystores-jks-with-lets-encrypt/34754>`__
|
|
|
|
SMTP can also integrate with letsencrypt for TLS; see
|
|
https://starttls-everywhere.org/ and
|
|
https://www.jwz.org/blog/2018/06/starttls-everywhere/
|
|
|
|
Is letsencrypt a long term solution?
|
|
------------------------------------
|
|
|
|
The `Internet Security Research Group
|
|
<https://letsencrypt.org/isrg/>`__ and by extension the letsencrypt
|
|
project appears to have gained a critical mass of support, making it
|
|
the defacto solution for domain validation certificates. We can
|
|
benefit and contribute to open tooling around letsencrypt.
|
|
|
|
How do you validate you own a domain for renewal?
|
|
-------------------------------------------------
|
|
|
|
* https://letsencrypt.org/how-it-works/
|
|
|
|
You run an agent that communicates to letsencrypt, which challenges it
|
|
to either
|
|
|
|
* put a http resource on the domain
|
|
* add a dns entry
|
|
|
|
Once done, it will accept the public key associated with the challenge
|
|
and issue certificates.
|
|
|
|
Are there rate limit concerns?
|
|
------------------------------
|
|
|
|
* https://letsencrypt.org/docs/rate-limits/
|
|
|
|
50 new certificates per registered domain per week. It is possible to
|
|
have up to 100 names per certificate. Requests that count as renewals
|
|
count towards quota, but are also allowed if you are over quota.
|
|
|
|
The last renewal was 17 certificates, with one having 8 names and the
|
|
rest only 1 name (per clarkb). Thus we are well within rate limits.
|
|
|
|
Any interactions with other TLS infrastructure?
|
|
------------------------------------------------
|
|
|
|
There already exists a ``*.openstack.org`` certificate in use for the
|
|
caching provider for ``www.openstack.org`` and is thus outside infra
|
|
control. There is also a certificate for
|
|
``openstackid-resources.openstack.org`` which was not infra issued.
|
|
|
|
Can we use the HTTP challenge model?
|
|
------------------------------------
|
|
|
|
This model has a chicken-egg problem of needing a webserver to respond
|
|
before being able to get a certificate for the webserver you are
|
|
trying to secure. There are several facets to this.
|
|
|
|
It would be possible for any host to configure it's Apache/webserver
|
|
to statically serve the ACME challenge, and have an agent run
|
|
periodically on the host to generate the keys. Some common roles may
|
|
be useful in the base job repos, but essentially each service would
|
|
set this up in a besopke manner for it's environment.
|
|
|
|
This probably works fine for completely new services. However, in a
|
|
server-replacement scenario, we can not transparently generate new
|
|
keys for the server before switching the A/AAAA DNS entries. For
|
|
example, if the ``foo.opendev.org`` service is handled by
|
|
``foo01.opendev.org`` and we wish to make ``foo02.opendev.org``
|
|
replace it, we can not generate a certificate on ``foo02.opendev.org``
|
|
that covers ``foo.opendev.org,foo2.opendev.org`` before
|
|
``foo.opendev.org`` is redirected. Thus we either need longer
|
|
downtime, or to start copying keys, etc. This *can* be done with DNS
|
|
validation, because all we need to do is prove that we can update
|
|
``foo.opendev.org``; not actually switch it's A/AAAA records.
|
|
|
|
Incorporating HTTP challenge response into our existing deployment is
|
|
problematic. Updating Apache configuration, installing ACME agents
|
|
and other setup crosses into a lot of existing puppet which would need
|
|
to be heavily reworked.
|
|
|
|
Other approaches can could be considered. Certificates could be
|
|
initially issued centrally and the remote HTTP server would need to be
|
|
either given the token respond to the challenge, or otherwise somehow
|
|
forward the challenge back to the issuing host. This is possible with
|
|
redirects and such; but again requires much more engineering than a
|
|
DNS based approach for no benefits. It helps the extant puppet
|
|
situation of distributing keys from central storage, but makes things
|
|
more difficult for new deployments where we wish to keep keys on host
|
|
only.
|
|
|
|
We would also loose some flexibility; port 80/443 doesn't *have* to be
|
|
bound to a "standard" webserver, and thus it may require bespoke
|
|
solutions to ensure the HTTP challenge URL can be responded to. With
|
|
DNS validation we never have to worry about making the host respond
|
|
correctly.
|
|
|
|
The conclusion of this spec is that HTTP authentication is not
|
|
suitable for our requirements.
|
|
|
|
DNS Validation Overview
|
|
-----------------------
|
|
|
|
DNS based validation seems a more reasonable path forward. We have
|
|
two DNS environments to consider.
|
|
|
|
Firstly, the extant ``openstack.org`` domain is hosted by Rackspace's
|
|
Cloud DNS offering. We do not expect to start new services homed
|
|
within the ``openstack.org`` domain, but we would ideally like to
|
|
support existing services having valid letsencrypt certificates.
|
|
|
|
Secondly, infra runs its own authoritative nameservers as part of the
|
|
OpenDev initiative. This is the primary focus.
|
|
|
|
Are certificates precious?
|
|
--------------------------
|
|
|
|
Moving forward, we do not want to continue distributing certificates
|
|
from centralised secrets; they should be considered ephemeral and
|
|
generated and kept on the host as much as possible. i.e. certificates
|
|
should not be precious.
|
|
|
|
However, there will be a period of transition as service deployment
|
|
moves from extant puppet modules which are taking their key data from
|
|
centralised secrets.
|
|
|
|
Ideally we can handle both.
|
|
|
|
What about load balancing?
|
|
--------------------------
|
|
|
|
With DNS based validation, every host can have a certificate that is
|
|
valid for itself and the balanced domain. For example,
|
|
``git01.openstack.org`` will request a certificate to cover
|
|
``git.openstack.org,git01.openstack.org``; upon request the two
|
|
relevant TXT authentication records (one for each domain) are placed
|
|
into DNS and the authentication is complete.
|
|
|
|
What ACME agent/tool should we use?
|
|
------------------------------------
|
|
|
|
The underlying protocol for getting certificates is called `ACME
|
|
<https://letsencrypt.org/how-it-works/>`__. There are many clients;
|
|
see `<https://letsencrypt.org/docs/client-options/>`__.
|
|
|
|
Many of these tools are quite extensive, as they perform a range of
|
|
operations like automatically deploying and updating
|
|
apache/ngnix/other configuration. As we manage configuration via our
|
|
own config management, we do not need any of these features.
|
|
|
|
The `acme.sh <https://github.com/Neilpang/acme.sh>` agent stands out
|
|
as being particularly suitable.
|
|
|
|
* It is implemented in ``sh`` (specifically *not* ``bash``) with very
|
|
standard UNIX tools (curl, openssl, etc). In contrast, tools like
|
|
`certbot <https://github.com/certbot/certbot>`__ need a python
|
|
environment. i.e. it basically runs anywhere, including tiny
|
|
containers.
|
|
* It is currently maintained and an active project, with CI
|
|
* We have some experience with large shell-based projects (devstack,
|
|
d-g). The code looks good.
|
|
* It supports authentication via the two methods of interest to us; a
|
|
"manual" mode where putting the TXT records is left up to the user,
|
|
and it can operate directly on Rackspace DNS API.
|
|
* For future flexibility can operate on many other hosted DNS
|
|
environments (like 50+ different ones).
|
|
* While it does have certificate deployment options to update
|
|
apache/ngnix/exim/sshd/and so on, they are all opt-in and not
|
|
necessary.
|
|
* It has an "I (heart) Docker" sticker on the page
|
|
|
|
For these reasons, ``acme.sh`` is used below.
|
|
|
|
Proposed Change
|
|
===============
|
|
|
|
The proposal is in two parts: the first covers the legacy
|
|
``openstack.org`` environment, whilst the second uses the self-hosted
|
|
nameservers behind ``opendev.org``.
|
|
|
|
openstack.org
|
|
-------------
|
|
|
|
Note: this proposal is prototyped at https://review.openstack.org/637456
|
|
|
|
An ansible-playbook can be written to iterate each ``openstack.org``
|
|
host of interest and generate certificate material with ``acme.sh``
|
|
using Rackspace DNS for letsencrypt authentication. This playbook
|
|
runs periodically on ``bridge.o.o`` (once a day is sufficient) and
|
|
will store any generated certificate material in a private certifcate
|
|
storage area.
|
|
|
|
Minor post-processing after issuing a certificate combines the
|
|
resulting file contents of the key material into a single YAML file
|
|
for each host, with pre-defined and unchanging variable names.
|
|
|
|
As part of the base run, this generated YAML file with certificate
|
|
data is copied into the relevant host's hiera directory (and hiera is
|
|
configured to load from this file). At this point, extant puppet has
|
|
access to the certificates as normal hiera variables and can deploy
|
|
them with very little change.
|
|
|
|
OpenDev
|
|
-------
|
|
|
|
Note: this proposal is prototyped at https://review.openstack.org/636759
|
|
|
|
For projects under the OpenDev umbrella we have two goals:
|
|
|
|
* Key material ephemeral and stored on server
|
|
* Using infra hosted DNS servers
|
|
|
|
We create an ansible role that will generate the certifiate data into
|
|
files on a given host. We will consider actually *using* the
|
|
certificate data as implementation specific; playbooks for ensuring
|
|
service restart, mounting into containers, etc will need to be
|
|
incorporated on top of this proposal.
|
|
|
|
The proposal is as follows:
|
|
|
|
* We utilise a separate signing domain, hereon referred to as
|
|
``acme-opendev.org`` (see note below). This is preconfigured, and
|
|
only used for ACME authentication.
|
|
* Hosts that want a certificate preconfigure a CNAME record
|
|
``_acme-challenge.hostname.opendev.org`` pointing to
|
|
``_acme-challenge.acme-opendev.org``. This is committed via the
|
|
normal review process to the main DNS configuration.
|
|
* Each host defines the certificates it wants and the domains that
|
|
should be covered by that certificate in a known configuration
|
|
variable.
|
|
* Provisioning happens via ansible roles running on ``bridge.o.o``
|
|
against the remote host that wants certificates as part of the
|
|
standard ``base.yaml`` playbook. The roles:
|
|
|
|
0. Ensure a valid install of ``acme.sh`` on the remote host
|
|
1. For each certificate required on the host, run ``acme.sh`` in
|
|
"DNS manual mode" on the remote host with the required domains to
|
|
be authenticated.
|
|
2. This generates an authentication token(s) output that needs to be
|
|
put into a TXT record on the signing domain. This is stored in a
|
|
pre-known host variable for the next step.
|
|
3. Another role then reads the authentication tokens from each host
|
|
via the variables, which are collated and put into a TXT records
|
|
created on ``adns1.opendev.org`` as
|
|
``_acme-challenge.acme-opendev.org``. Any necessary reloads
|
|
etc. are done to make the records visible (see note below on
|
|
lifespan).
|
|
4. When records are available, the acme.sh renewal process is run on
|
|
the remote host. The remote letsencrypt servers authenticate the
|
|
TXT records, and certificate material (cert, private key, chain
|
|
file) is generated and put in a well-known location on the host.
|
|
|
|
The host now has valid keys and further roles/playbooks can configure
|
|
services to use them.
|
|
|
|
Notes:
|
|
|
|
* we *could* have the signing domain as a sub-domain of
|
|
``opendev.org``. We choose a separate domain for several reasons.
|
|
Firstly, a large benefit of the signing domain is that it is
|
|
completely separate to your production domain; it is one less thing
|
|
to potentially mess up. Secondly, having a separate domain gives
|
|
us excellent forward flexibility -- for example; in the future we
|
|
could host this domain somewhere that has an API for updating TXT
|
|
records (designate instance, say) and use that directly for ACME
|
|
authentication. This would be transparent to ``opendev.org`` or
|
|
any other hosts.
|
|
* letsencyrpt `walks all the TXT records
|
|
<https://github.com/letsencrypt/boulder/blob/1fe8aa8128078f7d822b3e3f3bcc979003896cc0/va/va.go#L759:L767>`__,
|
|
and stops when it finds a match. So we don't have any problem with
|
|
multiple concurrent updates.
|
|
* The TXT records are not explicitly removed. They remain until the
|
|
next time a certificate is issued or renewed, when the entire
|
|
signing zone file is overwritten. They are not in any way
|
|
sensitive data. Their useful life, however, is only within the
|
|
single ansible/puppet pulse they were created in.
|
|
* The exact renewal process will be defined when we can issue
|
|
certificates. The plan is that if the host has all valid
|
|
certificates, step 1 above will return an empty list an nothing
|
|
further will be done. If any certificate has less than 30 days
|
|
remaining, acme.sh should automatically renew it, which is
|
|
essentially the same as issuing a new certificate.
|
|
* For organisations, letsencrypt has group accounts and such. These
|
|
will be implemented once basic issuance and renewal is working.
|
|
|
|
Alternatives
|
|
------------
|
|
|
|
* Continue with manual renewal
|
|
* We could only implement the ``opendev.org`` portion, leaving the RAX
|
|
DNS based authentication for ``openstack.org`` alone, assuming it
|
|
will not be required after the renewal period of the existing
|
|
certificates.
|
|
|
|
Implementation
|
|
==============
|
|
|
|
Assignee(s)
|
|
-----------
|
|
|
|
Who is leading the writing of the code? Or is this a blueprint where you're
|
|
throwing it out there to see who picks it up?
|
|
|
|
If more than one person is working on the implementation, please designate the
|
|
primary author and contact.
|
|
|
|
Primary assignee:
|
|
ianw
|
|
|
|
Can optionally list additional ids if they intend on doing substantial
|
|
implementation work on this blueprint.
|
|
|
|
Gerrit Topic
|
|
------------
|
|
|
|
Use Gerrit topic "letsencrypt-infra" for all patches related to this spec.
|
|
|
|
.. code-block:: bash
|
|
|
|
git-review -t letsencrypt-infra
|
|
|
|
Work Items
|
|
----------
|
|
|
|
* The broad strokes of the proposals are layed out in the prototypes.
|
|
Getting those various roles production ready, tested and documented
|
|
are the major work items.
|
|
|
|
|
|
Repositories
|
|
------------
|
|
|
|
Unlikely to require new repositories. Jobs will be in existing repos.
|
|
|
|
Servers
|
|
-------
|
|
|
|
No new servers required.
|
|
|
|
DNS Entries
|
|
-----------
|
|
|
|
``acme-opendev.org`` or whatever separate domain is used for CNAME
|
|
redirection will need to be provisioned.
|
|
|
|
|
|
Documentation
|
|
-------------
|
|
|
|
Standard infra documentation (system-config, job documentation, etc)
|
|
will need updates, but no external documentation should require
|
|
updates.
|
|
|
|
Security
|
|
--------
|
|
|
|
We are dealing with more private halves of keys. However we are
|
|
moving to a model where they are considered ephemeral and can be
|
|
rotated at any time.
|
|
|
|
|
|
Testing
|
|
-------
|
|
|
|
* We can stage implementation of keys to hosts, meaning no "flag" day
|
|
of switching everything, and we can test in isolation.
|
|
|
|
* We do not have a good way to update DNS entries for gate CI testing.
|
|
Initially we can generate self-signed certificates. For future
|
|
work, there are docker environments that run a LE environment. We
|
|
could stand-up one of these either on a per-test basis or as a
|
|
service (might be useful if we have a set CI signging key we could
|
|
configure hosts under test to trust).
|
|
|
|
* Initial production testing can be done with the letsencrypt staging
|
|
environment, which doesn't give valid keys but also doesn't have
|
|
restrictive limits if things are failing.
|
|
|
|
* Ongoing validation with certcheck
|
|
|
|
|
|
Dependencies
|
|
============
|
|
|
|
This work needs to integrate with ongoing work in the `update config
|
|
management spec`_.
|
|
|
|
.. _`update config management spec`: https://docs.opendev.org/opendev/infra-specs/latest/specs/update-config-management.html
|