64084f25cd
This change adds documentation for how we set up ssl for the infracloud regions. Change-Id: I96bcd1d10cd7e14230ba1f1dbcffb08bf75ba9dc
250 lines
9.1 KiB
ReStructuredText
250 lines
9.1 KiB
ReStructuredText
:title: Infra Cloud
|
|
|
|
.. _infra_cloud:
|
|
|
|
Infra Cloud
|
|
###########
|
|
|
|
Introduction
|
|
============
|
|
|
|
With donated hardware and datacenter space, we can run an optimized
|
|
semi-private cloud for the purpose of adding testing capacity and also
|
|
with an eye for "dog fooding" OpenStack itself.
|
|
|
|
Current Status
|
|
==============
|
|
|
|
Currently this cloud is in the planning and design phases. This section
|
|
will be updated or removed as that changes.
|
|
|
|
Mission
|
|
=======
|
|
|
|
The infra-cloud's mission is to turn donated raw hardware resources into
|
|
expanded capacity for the OpenStack infrastructure nodepool.
|
|
|
|
Methodology
|
|
===========
|
|
|
|
Infra-cloud is run like any other infra managed service. Puppet modules
|
|
and Ansible do the bulk of configuring hosts, and Gerrit code review
|
|
drives 99% of activities, with logins used only for debugging and
|
|
repairing the service.
|
|
|
|
Requirements
|
|
============
|
|
|
|
* Compute - The intended workload is mostly nodepool launched Jenkins
|
|
slaves. Thus flavors that are capable of running these tests in a
|
|
reasonable amount of time must be available. The flavor(s) must provide:
|
|
|
|
* 8GB RAM
|
|
|
|
* 8 * `vcpu`
|
|
|
|
* 30GB root disk
|
|
|
|
* Images - Image upload must be allowed for nodepool.
|
|
|
|
* Uptime - Because there are other clouds that can keep some capacity
|
|
running, 99.9% uptime should be acceptable.
|
|
|
|
* Performance - The performance of compute and networking in infra-cloud
|
|
should be at least as good as, if not better than, the other nodepool
|
|
clouds that infra uses today.
|
|
|
|
* Infra-core - Infra-core is in charge of running the service.
|
|
|
|
Implementation
|
|
==============
|
|
|
|
Multi-Site
|
|
----------
|
|
|
|
Despite the servers being in the same physical location and network,
|
|
they are divided in at least two logical "sites", vanilla and chocolate,
|
|
Each site will have its own cloud, and these clouds will share no data.
|
|
|
|
Vanilla
|
|
~~~~~~~
|
|
|
|
The vanilla cloud has 48 machines. Each machine has 96G of RAM, 1.8TiB of disk and
|
|
24 Cores of Intel Xeon X5650 @ 2.67GHz processors.
|
|
|
|
Chocolate
|
|
~~~~~~~~~
|
|
|
|
The chocolate cloud has 100 machines. Each machine has 96G of RAM, 1.8TiB of disk and
|
|
32 Cores of Intel Xeon E5-2670 0 @ 2.60GHz processors.
|
|
|
|
Software
|
|
--------
|
|
|
|
Infra-cloud runs the most recent OpenStack stable release. During the
|
|
period following a release, plans must be made to upgrade as soon as
|
|
possible. In the future the cloud may be continuously deployed.
|
|
|
|
Management
|
|
----------
|
|
|
|
* Currently a single "Ironic Controller" is installed by hand and used by both
|
|
sites. That machine is enrolled into the puppet/ansible infrastructure and
|
|
can be reached at baremetal00.vanilla.ic.openstack.org.
|
|
|
|
* The "Ironic Controller" will have bifrost installed on it. All of the
|
|
other machines in that site will be enrolled in the Ironic that bifrost
|
|
manages. bifrost will be responsible for booting base OS with IP address
|
|
and ssh key for each machine.
|
|
|
|
* You can interact with the Bifrost Ironic installation by sourcing
|
|
``/opt/stack/bifrost/env-vars`` then running the ironic cli client (for
|
|
example: ``ironic node-list``).
|
|
|
|
* The machines will all be added to a manual ansible inventory file adjacent
|
|
to the dynamic inventory that ansible currently uses to run puppet. Any
|
|
metadata that the ansible infrastructure for running puppet needs that
|
|
would have come from OpenStack infrastructure will simply be put into
|
|
static ansible group_vars.
|
|
|
|
* The static inventory should be put into puppet so that it is public, with
|
|
the IPMI passwords in hiera.
|
|
|
|
* An OpenStack Cloud with KVM as the hypervisor will be installed using
|
|
OpenStack puppet modules as per normal infra installation of services.
|
|
|
|
* As with all OpenStack services, metrics will be collected in public
|
|
cacti and graphite services. The particular metrics are TBD.
|
|
|
|
* As a cloud has a large amount of pertinent log data, a public ELK cluster
|
|
will be needed to capture and expose it.
|
|
|
|
* All Infra services run on the public internet, and the same will be true
|
|
for the Infra Clouds and the Ironic Clouds. Insecure services that need
|
|
to be accessible across machine boundaries will employ per-IP iptables
|
|
rules rather then relying on a squishy middle.
|
|
|
|
Architecture
|
|
------------
|
|
|
|
The generally accepted "Controller" and "Compute" layout is used,
|
|
with controllers running all non-compute services and compute nodes
|
|
running only nova-compute and supporting services.
|
|
|
|
* The cloud is deployed with two controllers in a DRBD storage pair
|
|
with ACTIVE/PASSIVE configured and a VIP shared between the two.
|
|
This is done to avoid complications with Galera and RabbitMQ at
|
|
the cost of making failovers more painful and under-utilizing the
|
|
passive stand-by controller.
|
|
|
|
* The cloud will use KVM because it is the default free hypervisor and
|
|
has the widest user base in OpenStack.
|
|
|
|
* The cloud will use Neutron configured for Provider VLAN because we
|
|
do not require tenant isolation and this simplifies our networking on
|
|
compute nodes.
|
|
|
|
* The cloud will not use floating IPs because every node will need to be
|
|
reachable via routable IPs and thus there is no need for separation. Also
|
|
Nodepool is under our control, so we don't have to worry about DNS TTLs
|
|
or anything else causing a need for a particular endpoint to remain at
|
|
a stable IP.
|
|
|
|
* The cloud will not use security groups because these are single use VMs
|
|
and they will configure any firewall inside the VM.
|
|
|
|
* The cloud will use MySQL because it is the default in OpenStack and has
|
|
the widest user base.
|
|
|
|
* The cloud will use RabbitMQ because it is the default in OpenStack and
|
|
has the widest user base. We don't have scaling demands that come close
|
|
to pushing the limits of RabbitMQ.
|
|
|
|
* The cloud will run swift as a backend for glance so that we can scale
|
|
image storage out as need arises.
|
|
|
|
* The cloud will run keystone v3 and glance v2 APIs because these are the
|
|
versions upstream recommends using.
|
|
|
|
* The cloud will run keystone on port 443.
|
|
|
|
* The cloud will not use the glance task API for image uploads, it will use
|
|
the PUT interface because the task API does not function and we are not
|
|
expecting a wide user base to be uploading many images simultaneously.
|
|
|
|
* The cloud will provide DHCP directly to its nodes because we trust DHCP.
|
|
|
|
* The cloud will have config drive enabled because we believe it to be more
|
|
robust than the EC2-style metadata service.
|
|
|
|
* The cloud will not have the meta-data service enabled because we do not
|
|
believe it to be robust.
|
|
|
|
Networking
|
|
----------
|
|
|
|
Neutron is used, with a single `provider VLAN`_ attached to VMs for the
|
|
simplest possible networking. DHCP is configured to hand the machine a
|
|
routable IP which can be reached directly from the internet to facilitate
|
|
nodepool/zuul communications.
|
|
|
|
.. _provider VLAN: http://docs.openstack.org/networking-guide/scenario-provider-lb.html
|
|
|
|
Each site will need 2 VLANs. One for the public IPs which every NIC of every
|
|
host will be attached to. That VLAN will get a publicly routable /23. Also,
|
|
there should be a second VLAN that is connected only to the NIC of the
|
|
Ironic Cloud and is routed to the IPMI management network of all of the other
|
|
nodes. Whether we use LinuxBridge or Open vSwitch is still TBD.
|
|
|
|
SSL
|
|
---
|
|
|
|
Since we are the single user of Infracloud we have configured Vanilla and
|
|
Chocolate controllers to use the snakeoil ssl certs for each controller.
|
|
This gives us simple to generate certs with long lifetimes which we can trust
|
|
directly by asserting trust against the public cert.
|
|
|
|
If you need to update certs in one of the clouds simply run::
|
|
|
|
/usr/sbin/make-ssl-cert generate-default-snakeoil --force-overwrite
|
|
|
|
on the controller in question. Then copy the contents of
|
|
``/etc/ssl/certs/ssl-cert-snakeoil.pem`` to public system-config hiera and
|
|
``/etc/ssl/private/ssl-cert-snakeoil.key`` to private hiera on the
|
|
puppetmaster.
|
|
|
|
Puppet will then ensure we trust the public key everywhere that talks to the
|
|
controller (puppetmaster, nodepool, controller itself, compute nodes, etc)
|
|
and deploy the private key so that it is used by services.
|
|
|
|
Troubleshooting
|
|
===============
|
|
|
|
Regenerating images
|
|
-------------------
|
|
|
|
When redeploying servers with bifrost, we may have the need to refresh the image
|
|
that is deployed to them, because we may need to add some packages, update the
|
|
elements that we use, consume latest versions of projects...
|
|
|
|
To generate an image, you need to follow these steps::
|
|
|
|
1. In the baremetal server, remove everything under /httpboot directory.
|
|
This will clean the generated qcow2 image that is consumed by servers.
|
|
|
|
2. If there is a need to also update the CoreOS image, remove everything
|
|
under /tftpboot directory. This will clean the ramdisk image that is
|
|
used when PXE booting.
|
|
|
|
3. Run the install playbook again, so it generates the image. You need to
|
|
be sure that you pass the skip_install flag, to avoid the update of all
|
|
the bifrost related projects (ironic, dib, etc...):
|
|
|
|
ansible-playbook -vvv -e @/etc/bifrost/bifrost_global_vars \
|
|
-e skip_install=true \
|
|
-i /opt/stack/bifrost/playbooks/inventory/bifrost_inventory.py \
|
|
/opt/stack/bifrost/playbooks/install.yaml
|
|
|
|
4. After the install finishes, you can redeploy the servers again
|
|
using ``run_bifrost.sh`` script.
|