We moved publishing of this repo to https://docs.opendev.org/opendev/infra-specs/latest/, update all links. Also, remove unused import of autodoc from doc/source/conf.py Change-Id: I156b12d0e49362563cd8c39f30475bafe889c23b
18 KiB
Copyright 2019 Red Hat, Inc.
This work is licensed under a Creative Commons Attribution 3.0
Unported License.
http://creativecommons.org/licenses/by/3.0/legalcode
Use letsencrypt for infra SSL needs
Incorporate https://letsencrypt.org/ into our configuration management.
Problem Description
Currently SSL certificates for infrastructure services are manually purchased at some cost, renewed by hand and provisioned into configuration management. The process is costly, work-intensive and does not completely match our usual infrastructure-as-code workflows. The overheads are a barrier to getting encryption rolled out across all services.
We would like to replace manual provision of purchased certificates with those provided by https://letsencrypt.org/. The solution should make certificate provision and renewal completely automated and integrate with the existing configuration management environment.
What is the existing situation?
- Certificates manually renewed.
- Certificate key data is manually inserted into Ansible variables on bridge.o.o in the key storage area
- Key material is then deployed as part of usual periodic
ansible/puppet run on host; usually by puppet as a hiera variable (e.g.
ssl_cert => hiera('foo_ssl_cert')
) but possibly by ansible directly. - Self-signed certificates are created for a number of less "important" services, such as dev servers, or nodepool status pages. Primarily due to the cost an inconvenience of creating certificates.
What can we use letsencrypt for?
Domain validation certificates. This means you have proved you control the domain to letsencrypt. They do not offer higher level verification such as Extended Validation which prove the identity of the owner.
Our needs are currently exclusively domain validation certificates.
What about non-http services?
Although not a initial target, non-https services like zookeeper and gearman should be possible.
Zookeeper for example uses JKS which it should be possible to import
SMTP can also integrate with letsencrypt for TLS; see https://starttls-everywhere.org/ and https://www.jwz.org/blog/2018/06/starttls-everywhere/
Is letsencrypt a long term solution?
The Internet Security Research Group and by extension the letsencrypt project appears to have gained a critical mass of support, making it the defacto solution for domain validation certificates. We can benefit and contribute to open tooling around letsencrypt.
How do you validate you own a domain for renewal?
You run an agent that communicates to letsencrypt, which challenges it to either
- put a http resource on the domain
- add a dns entry
Once done, it will accept the public key associated with the challenge and issue certificates.
Are there rate limit concerns?
50 new certificates per registered domain per week. It is possible to have up to 100 names per certificate. Requests that count as renewals count towards quota, but are also allowed if you are over quota.
The last renewal was 17 certificates, with one having 8 names and the rest only 1 name (per clarkb). Thus we are well within rate limits.
Any interactions with other TLS infrastructure?
There already exists a *.openstack.org
certificate in
use for the caching provider for www.openstack.org
and is
thus outside infra control. There is also a certificate for
openstackid-resources.openstack.org
which was not infra
issued.
Can we use the HTTP challenge model?
This model has a chicken-egg problem of needing a webserver to respond before being able to get a certificate for the webserver you are trying to secure. There are several facets to this.
It would be possible for any host to configure it's Apache/webserver to statically serve the ACME challenge, and have an agent run periodically on the host to generate the keys. Some common roles may be useful in the base job repos, but essentially each service would set this up in a besopke manner for it's environment.
This probably works fine for completely new services. However, in a
server-replacement scenario, we can not transparently generate new keys
for the server before switching the A/AAAA DNS entries. For example, if
the foo.opendev.org
service is handled by
foo01.opendev.org
and we wish to make
foo02.opendev.org
replace it, we can not generate a
certificate on foo02.opendev.org
that covers
foo.opendev.org,foo2.opendev.org
before
foo.opendev.org
is redirected. Thus we either need longer
downtime, or to start copying keys, etc. This can be done with
DNS validation, because all we need to do is prove that we can update
foo.opendev.org
; not actually switch it's A/AAAA
records.
Incorporating HTTP challenge response into our existing deployment is problematic. Updating Apache configuration, installing ACME agents and other setup crosses into a lot of existing puppet which would need to be heavily reworked.
Other approaches can could be considered. Certificates could be initially issued centrally and the remote HTTP server would need to be either given the token respond to the challenge, or otherwise somehow forward the challenge back to the issuing host. This is possible with redirects and such; but again requires much more engineering than a DNS based approach for no benefits. It helps the extant puppet situation of distributing keys from central storage, but makes things more difficult for new deployments where we wish to keep keys on host only.
We would also loose some flexibility; port 80/443 doesn't have to be bound to a "standard" webserver, and thus it may require bespoke solutions to ensure the HTTP challenge URL can be responded to. With DNS validation we never have to worry about making the host respond correctly.
The conclusion of this spec is that HTTP authentication is not suitable for our requirements.
DNS Validation Overview
DNS based validation seems a more reasonable path forward. We have two DNS environments to consider.
Firstly, the extant openstack.org
domain is hosted by
Rackspace's Cloud DNS offering. We do not expect to start new services
homed within the openstack.org
domain, but we would ideally
like to support existing services having valid letsencrypt
certificates.
Secondly, infra runs its own authoritative nameservers as part of the OpenDev initiative. This is the primary focus.
Are certificates precious?
Moving forward, we do not want to continue distributing certificates from centralised secrets; they should be considered ephemeral and generated and kept on the host as much as possible. i.e. certificates should not be precious.
However, there will be a period of transition as service deployment moves from extant puppet modules which are taking their key data from centralised secrets.
Ideally we can handle both.
What about load balancing?
With DNS based validation, every host can have a certificate that is
valid for itself and the balanced domain. For example,
git01.openstack.org
will request a certificate to cover
git.openstack.org,git01.openstack.org
; upon request the two
relevant TXT authentication records (one for each domain) are placed
into DNS and the authentication is complete.
What ACME agent/tool should we use?
The underlying protocol for getting certificates is called ACME. There are many clients; see https://letsencrypt.org/docs/client-options/.
Many of these tools are quite extensive, as they perform a range of operations like automatically deploying and updating apache/ngnix/other configuration. As we manage configuration via our own config management, we do not need any of these features.
The acme.sh <https://github.com/Neilpang/acme.sh> agent stands out as being particularly suitable.
- It is implemented in
sh
(specifically notbash
) with very standard UNIX tools (curl, openssl, etc). In contrast, tools like certbot need a python environment. i.e. it basically runs anywhere, including tiny containers. - It is currently maintained and an active project, with CI
- We have some experience with large shell-based projects (devstack, d-g). The code looks good.
- It supports authentication via the two methods of interest to us; a "manual" mode where putting the TXT records is left up to the user, and it can operate directly on Rackspace DNS API.
- For future flexibility can operate on many other hosted DNS environments (like 50+ different ones).
- While it does have certificate deployment options to update apache/ngnix/exim/sshd/and so on, they are all opt-in and not necessary.
- It has an "I (heart) Docker" sticker on the page
For these reasons, acme.sh
is used below.
Proposed Change
The proposal is in two parts: the first covers the legacy
openstack.org
environment, whilst the second uses the
self-hosted nameservers behind opendev.org
.
openstack.org
Note: this proposal is prototyped at https://review.openstack.org/637456
An ansible-playbook can be written to iterate each
openstack.org
host of interest and generate certificate
material with acme.sh
using Rackspace DNS for letsencrypt
authentication. This playbook runs periodically on
bridge.o.o
(once a day is sufficient) and will store any
generated certificate material in a private certifcate storage area.
Minor post-processing after issuing a certificate combines the resulting file contents of the key material into a single YAML file for each host, with pre-defined and unchanging variable names.
As part of the base run, this generated YAML file with certificate data is copied into the relevant host's hiera directory (and hiera is configured to load from this file). At this point, extant puppet has access to the certificates as normal hiera variables and can deploy them with very little change.
OpenDev
Note: this proposal is prototyped at https://review.openstack.org/636759
For projects under the OpenDev umbrella we have two goals:
- Key material ephemeral and stored on server
- Using infra hosted DNS servers
We create an ansible role that will generate the certifiate data into files on a given host. We will consider actually using the certificate data as implementation specific; playbooks for ensuring service restart, mounting into containers, etc will need to be incorporated on top of this proposal.
The proposal is as follows:
- We utilise a separate signing domain, hereon referred to as
acme-opendev.org
(see note below). This is preconfigured, and only used for ACME authentication. - Hosts that want a certificate preconfigure a CNAME record
_acme-challenge.hostname.opendev.org
pointing to_acme-challenge.acme-opendev.org
. This is committed via the normal review process to the main DNS configuration. - Each host defines the certificates it wants and the domains that should be covered by that certificate in a known configuration variable.
- Provisioning happens via ansible roles running on
bridge.o.o
against the remote host that wants certificates as part of the standardbase.yaml
playbook. The roles:- Ensure a valid install of
acme.sh
on the remote host - For each certificate required on the host, run
acme.sh
in "DNS manual mode" on the remote host with the required domains to be authenticated. - This generates an authentication token(s) output that needs to be put into a TXT record on the signing domain. This is stored in a pre-known host variable for the next step.
- Another role then reads the authentication tokens from each host via
the variables, which are collated and put into a TXT records created on
adns1.opendev.org
as_acme-challenge.acme-opendev.org
. Any necessary reloads etc. are done to make the records visible (see note below on lifespan). - When records are available, the acme.sh renewal process is run on the remote host. The remote letsencrypt servers authenticate the TXT records, and certificate material (cert, private key, chain file) is generated and put in a well-known location on the host.
- Ensure a valid install of
The host now has valid keys and further roles/playbooks can configure services to use them.
Notes:
- we could have the signing domain as a sub-domain of
opendev.org
. We choose a separate domain for several reasons. Firstly, a large benefit of the signing domain is that it is completely separate to your production domain; it is one less thing to potentially mess up. Secondly, having a separate domain gives us excellent forward flexibility -- for example; in the future we could host this domain somewhere that has an API for updating TXT records (designate instance, say) and use that directly for ACME authentication. This would be transparent toopendev.org
or any other hosts.- letsencyrpt walks all the TXT records, and stops when it finds a match. So we don't have any problem with multiple concurrent updates.
- The TXT records are not explicitly removed. They remain until the next time a certificate is issued or renewed, when the entire signing zone file is overwritten. They are not in any way sensitive data. Their useful life, however, is only within the single ansible/puppet pulse they were created in.
- The exact renewal process will be defined when we can issue certificates. The plan is that if the host has all valid certificates, step 1 above will return an empty list an nothing further will be done. If any certificate has less than 30 days remaining, acme.sh should automatically renew it, which is essentially the same as issuing a new certificate.
- For organisations, letsencrypt has group accounts and such. These will be implemented once basic issuance and renewal is working.
Alternatives
- Continue with manual renewal
- We could only implement the
opendev.org
portion, leaving the RAX DNS based authentication foropenstack.org
alone, assuming it will not be required after the renewal period of the existing certificates.
Implementation
Assignee(s)
Who is leading the writing of the code? Or is this a blueprint where you're throwing it out there to see who picks it up?
If more than one person is working on the implementation, please designate the primary author and contact.
- Primary assignee:
-
ianw
Can optionally list additional ids if they intend on doing substantial implementation work on this blueprint.
Gerrit Topic
Use Gerrit topic "letsencrypt-infra" for all patches related to this spec.
git-review -t letsencrypt-infra
Work Items
- The broad strokes of the proposals are layed out in the prototypes. Getting those various roles production ready, tested and documented are the major work items.
Repositories
Unlikely to require new repositories. Jobs will be in existing repos.
Servers
No new servers required.
DNS Entries
acme-opendev.org
or whatever separate domain is used for
CNAME redirection will need to be provisioned.
Documentation
Standard infra documentation (system-config, job documentation, etc) will need updates, but no external documentation should require updates.
Security
We are dealing with more private halves of keys. However we are moving to a model where they are considered ephemeral and can be rotated at any time.
Testing
- We can stage implementation of keys to hosts, meaning no "flag" day of switching everything, and we can test in isolation.
- We do not have a good way to update DNS entries for gate CI testing. Initially we can generate self-signed certificates. For future work, there are docker environments that run a LE environment. We could stand-up one of these either on a per-test basis or as a service (might be useful if we have a set CI signging key we could configure hosts under test to trust).
- Initial production testing can be done with the letsencrypt staging environment, which doesn't give valid keys but also doesn't have restrictive limits if things are failing.
- Ongoing validation with certcheck
Dependencies
This work needs to integrate with ongoing work in the update config management spec.