4c86706e5e
This introduces and "Open Infrastructure" page which is designed for a moderately experienced developer with some understanding of Zuul, Ansible and basic Linux admin skills to have an entrypoint to navigating the system-config and related repositories. It is designed to re-enforce the idea of open infrastructure, and explain how development, testing and production come together at a level high enough to be understood, but with links or descriptions of specific places in the code to get started. It moves a little of what was in the sysadmin page into this, and leaves that page as more low-level descriptions of various tasks. Change-Id: I60a9299df455b98ad549ac0075a59d381722bc06
524 lines
19 KiB
ReStructuredText
524 lines
19 KiB
ReStructuredText
:title: System Administration
|
|
|
|
This page collects technical information of relevance to those
|
|
interested in admin of OpenDev services. For a higher-level overview,
|
|
see :ref:`opendev-infra-overview`.
|
|
|
|
.. _sysadmin:
|
|
|
|
System Administration
|
|
#####################
|
|
|
|
.. _ssh-access:
|
|
|
|
SSH Access
|
|
==========
|
|
|
|
For any of the systems managed by the OpenDev Infrastructure team, the
|
|
following practices must be observed for SSH access:
|
|
|
|
* SSH access is only permitted with SSH public/private key
|
|
authentication.
|
|
* Users must use a strong passphrase to protect their private key. A
|
|
passphrase of several words, at least one of which is not in a
|
|
dictionary is advised, or a random string of at least 16
|
|
characters.
|
|
* To mitigate the inconvenience of using a long passphrase, users may
|
|
want to use an SSH agent so that the passphrase is only requested
|
|
once per desktop session.
|
|
* Users private keys must never be stored anywhere except their own
|
|
workstation(s). In particular, they must never be stored on any
|
|
remote server.
|
|
* If users need to 'hop' from a server or bastion host to another
|
|
machine, they must not copy a private key to the intermediate
|
|
machine (see above). Instead SSH agent forwarding may be used.
|
|
However due to the potential for a compromised intermediate machine
|
|
to ask the agent to sign requests without the users knowledge, in
|
|
this case only an SSH agent that interactively prompts the user
|
|
each time a signing request (ie, ssh-agent, but not gnome-keyring)
|
|
is received should be used, and the SSH keys should be added with
|
|
the confirmation constraint ('ssh-add -c').
|
|
* The number of SSH keys that are configured to permit access to
|
|
OpenDev machines should be kept to a minimum.
|
|
* OpenDev Infrastructure machines must use Ansible to centrally manage
|
|
and configure user accounts, and the SSH authorized_keys files from
|
|
the opendev/system-config repository.
|
|
* SSH keys should be periodically rotated (at least once per year).
|
|
During rotation, a new key can be added to puppet for a time, and
|
|
then the old one removed.
|
|
|
|
|
|
Gerrit Admins
|
|
=============
|
|
|
|
To provide a reasonable firewall from outside authentication systems,
|
|
Gerrit administrators keep two accounts: one for normal code review
|
|
activity and one for performing Gerrit administration. Following the same
|
|
pattern as our Kerberos administrator account logins, the admin account
|
|
corresponding to ``$USER`` would be ``$USER.admin`` (Gerrit doesn't allow
|
|
``/`` in usernames) so they can be easily identified when auditing
|
|
activity. Unlike the normal code review account, the admin account should
|
|
have no OpenID so that it is only accessable by API/CLI methods so they
|
|
cannot be compromised at the third-party ID provider.
|
|
|
|
To create a personal Gerrit admin account from a shell on the server, run
|
|
the following command::
|
|
|
|
sudo -u gerrit2 ssh -i ~gerrit2/review_site/etc/ssh_host_rsa_key \
|
|
-p 29418 -l 'Gerrit Code Review' localhost \
|
|
"suexec --as openstack-project-creator -- \
|
|
gerrit create-account --group Administrators --full-name myname.admin \
|
|
--ssh-key 'ssh-rsa AAAA...BCDE myname@computer' myname.admin"
|
|
|
|
We ``suexec`` as the ``openstack-project-creator`` account because the
|
|
magic ``Gerrit Code Review`` pseudoaccount can't set group memberships so
|
|
we need to run that command as a user which is already in the
|
|
``Administrators`` group. With an account like this, routine actions like
|
|
populating new groups with initial members is still quite simple::
|
|
|
|
ssh -p 29418 myname.admin@review.opendev.org \
|
|
"gerrit set-members some-new-group --add somebody@example.org"
|
|
|
|
Another common example is bypassing Zuul to submit a change for merging
|
|
directly to a project. In this case we must first add our account to
|
|
another group which has permission to set the relevant labels (it doesn't
|
|
get that simply by being an administrator), and then do the
|
|
commenting/voting/submitting, followed by cleaning up the extra group
|
|
membership again at the end::
|
|
|
|
ssh -p 29418 myname.admin@review.opendev.org \
|
|
"gerrit set-members 'Project Bootstrappers' --add myname.admin"
|
|
|
|
ssh -p 29418 myname.admin@review.opendev.org \
|
|
"gerrit review 12345,6 --message 'Bypassing Zuul to merge this.' \
|
|
--code-review=2 --verified=2 --label workflow=1 --submit"
|
|
|
|
ssh -p 29418 myname.admin@review.opendev.org \
|
|
"gerrit set-members 'Project Bootstrappers' --remove myname.admin"
|
|
|
|
Note that it's possible to temporarily add your normal OpenID-associated
|
|
WebUI account to the ``Administrators`` group or other groups with similar
|
|
superuser permissions like ``Project Bootstrappers``, but keep in mind that
|
|
an attacker who has quietly gained control of your account at the OpenID
|
|
provider could be waiting for that opportunity to take advantage of the
|
|
added permissions, or you may simply forget to remove the account afterward
|
|
negating the added safety of this account separation.
|
|
|
|
For more examples, see the detailed documentation for Gerrit's SSH CLI,
|
|
available on our server:
|
|
https://review.opendev.org/Documentation/cmd-index.html
|
|
|
|
GitHub Access
|
|
=============
|
|
|
|
To ensure that code review and testing are not bypassed in the public
|
|
Git repositories, only Gerrit will be permitted to commit code to
|
|
OpenDev repositories. Because GitHub always allows project
|
|
administrators to commit code, accounts that have access to manage the
|
|
GitHub projects necessarily will have commit access to the
|
|
repositories.
|
|
|
|
A shared Github administrative account is available (credentials
|
|
stored in the global authentication location). If administrators
|
|
would prefer to keep a separate account, it can be added to the
|
|
organisation after discussion and noting the caveats around elevated
|
|
access. The account must have 2FA enabled.
|
|
|
|
In either case, the administrator accounts should not be used to check
|
|
out or commit code for any project.
|
|
|
|
Note that it is unlikely to be useful to use an account also used for
|
|
active development, as you will be subscribed to many notifications
|
|
for all projects.
|
|
|
|
Root only information
|
|
#####################
|
|
|
|
Below is information relevant to members of the core team with root
|
|
access.
|
|
|
|
Accessing Clouds
|
|
================
|
|
|
|
As an unprivileged user who is a member of the `sudo` group on bridge,
|
|
you can inspect any of the clouds with::
|
|
|
|
sudo openstack --os-cloud <cloud name> --os-cloud-region <region name>
|
|
|
|
Backups
|
|
=======
|
|
|
|
Infra uses the `borg <https://borgbackup.readthedocs.io>`__ backup
|
|
tool.
|
|
|
|
Hosts in the ``borg-backup`` Ansible inventory group will be backed up
|
|
to servers in the ``borg-backup-server`` group with ``borg``. The
|
|
``playbooks/roles/borg-backup`` and
|
|
``playbooks/roles/borg-backup-server`` roles implement the required
|
|
setup.
|
|
|
|
The backup server has a unique Unix user for each host to be backed
|
|
up. The roles will setup required users, their home directories in
|
|
the backup volume and relevant ``authorized_keys``.
|
|
|
|
Host backup happens via a daily cron job (managed by Ansible) on each
|
|
individual host to be backed up. The host to be backed up initiates
|
|
the backup process to the remote backup server(s) using a separate ssh
|
|
key setup just for backup communication (see ``/root/.ssh/config``).
|
|
|
|
Setting up hosts for backup
|
|
---------------------------
|
|
|
|
To setup a host for backup, put it in the ``borg-backup`` group.
|
|
|
|
Hosts can specify ``borg_backup_excludes_extra`` and
|
|
``borg_backup_dirs_extra`` to exclude or include specific directories
|
|
as required (see role documentation for more details).
|
|
|
|
``borg`` splits backup data into chunks and de-duplicates as much as
|
|
possible. For backing up large items, particularly things like
|
|
database dumps, we want to give ``borg`` as much chance to
|
|
de-duplicate as possible. Approaches such as dumping to compressed
|
|
files on disk defeat de-duplication because all the data changes for
|
|
each dump.
|
|
|
|
For dumping large data, hosts should put a file into
|
|
``/etc/borg-streams`` that performs the dump in an uncompressed manner
|
|
to stdout. The backup scripts will create a separate archive for each
|
|
stream defined here. For more details, see the ``backup`` role
|
|
documentation. These streams should attempt to be as friendly to
|
|
de-duplication as possible; see some of the examples of ``mysqldump``
|
|
to find arguments that help keep the output data more stable (and
|
|
hence more easily de-duplicated).
|
|
|
|
Restore from Backup
|
|
-------------------
|
|
|
|
Hosts have ``/usr/local/bin/borg-mount`` (specify one of the backup
|
|
servers as an argument) that will mount the backups to
|
|
``/opt/backups`` via FUSE.
|
|
|
|
``borg`` has other options for restoring. If you need to extract on
|
|
the backup server itself, a basic way to dump a host at a particular
|
|
time is to
|
|
|
|
* log into the backup server
|
|
* sudo ``su -`` to switch to the backup user for the host to be restored
|
|
* you will now be in the home directory of that user
|
|
* run ``/opt/borg/bin/borg list ./backup`` to list the archives available
|
|
* these should look like ``<hostname>-<stream>-YYYY-MM-DDTHH:MM:SS``
|
|
* move to working directory
|
|
* extract one of the appropriate archives with ``/opt/borg/bin/borg extract ~/backup <archive-tag>``
|
|
|
|
Managing backup storage
|
|
-----------------------
|
|
|
|
We run ``borg`` in append-only mode. This means clients can not
|
|
remove old backups on the server.
|
|
|
|
However, due to the way borg works, append-only mode plays all client
|
|
transactions into a transaction log until a read-write operation
|
|
occurs. Examining the repository will appear to have all these
|
|
transactions applied (e.g. pruned archives will not appear; even if
|
|
they have not actually been pruned from disk). If you have reason to
|
|
not trust the state of the backup, you should *not* run any read-write
|
|
operations. You will need to manually examine the transaction log and
|
|
roll-back to a known good state; see
|
|
`<https://borgbackup.readthedocs.io/en/stable/usage/notes.html#append-only-mode>`__.
|
|
|
|
However, we have limited backup space. Each backup server has a
|
|
script ``/usr/local/bin/prune-borg-backups`` which can be run to
|
|
reclaim space. This should be run in a ``screen`` instance as it can
|
|
take a considerable time. It will prompt when run; you can confirm
|
|
the process with a ``noop`` run; confirming the prune will log the
|
|
output to ``/opt/backups``. This will keep the last 7 days of backups,
|
|
then monthly backups for 1 year and yearly backups for each archive.
|
|
The backup servers will send a warning when backup volume usage is
|
|
high, at which point this can be run manually.
|
|
|
|
.. _force-merging-a-change:
|
|
|
|
Force-Merging a Change
|
|
======================
|
|
|
|
Occasionally it is necessary to bypass the CI system and merge a
|
|
change directly. Usually, this is only required if we have a hole in
|
|
our testing of the CI or related systems themselves and have merged a
|
|
change which causes them to be unable to operate normally and
|
|
therefore unable to merge a reversion of the problematic change. In
|
|
these cases, use the following procedure to force-merge a change.
|
|
|
|
* Add yourself to the *Project Bootstrappers* group in Gerrit.
|
|
|
|
* Navigate to the change which needs to be merged and reload the page.
|
|
|
|
* Remove any -2 votes on the change.
|
|
|
|
* Add +2 Code-Review, and +1 Workflow votes if necessary, then add +2
|
|
Verified. Also leave a review comment briefly explaining why this
|
|
was necessary, and make sure to mention it in the #opendev
|
|
IRC channel (ideally as a #status log entry for the benefit of
|
|
those not paying close attention to scrollback).
|
|
|
|
* At this point, a *Submit* Button should appear, click it. The
|
|
change should now be merged.
|
|
|
|
* Remove yourself from *Project Bootstrappers*
|
|
|
|
This procedure is the safest way to force-merge a change, ensuring
|
|
that all of the normal steps that Gerrit performs on repos still
|
|
happen.
|
|
|
|
Launching New Servers
|
|
=====================
|
|
|
|
New servers are launched using the ``launch/launch-node.py`` tool from the git
|
|
repository ``https://opendev.org/opendev/system-config``. This
|
|
tool is run from a checkout on the bridge - please see :git_file:`launch/README.rst`
|
|
for detailed instructions.
|
|
|
|
.. _disable-enable-ansible:
|
|
|
|
Disable/Enable Ansible
|
|
======================
|
|
|
|
You should normally not make manual changes to servers, but instead,
|
|
make changes through ansible or puppet. However, under some circumstances,
|
|
you may need to temporarily make a manual change to a managed
|
|
resource on a server.
|
|
|
|
OpenDev uses a Static Inventory in Ansible to control execution of Ansible
|
|
on hosts. A full understanding
|
|
of the concepts in
|
|
`Ansible Inventory Introduction
|
|
<http://docs.ansible.com/ansible/intro_inventory.html>`_
|
|
is essential for being able to make informed decisions about actions
|
|
to take.
|
|
|
|
In the case of needing to disable the running of ansible or puppet on a node,
|
|
it's a simple matter of adding an entry to the ansible inventory "disabled" group
|
|
in :git_file:`inventory/groups.yaml`. The
|
|
disabled entry is an input to `ansible --list-hosts` so you can check your
|
|
entry simply by running it with `ansible $hostlist --list-hosts` as root
|
|
on the bridge host and ensuring that the list of hosts returned is as
|
|
expected. Globs, group names and server UUIDs should all be acceptable input.
|
|
|
|
If you need to disable a host immediately without waiting for a patch to land
|
|
to `system-config`, there is a file on the bridge host,
|
|
`/etc/ansible/hosts/emergency.yaml` that can be edited directly.
|
|
|
|
`/etc/ansible/hosts/emergency.yaml` is a file that should normally be empty,
|
|
but the contents are not managed by ansible. It's purpose is to allow for
|
|
disabling ansible at times when landing a change to the ansible repo would be
|
|
either unreasonable or impossible.
|
|
|
|
Disabling puppet via ansible inventory does not disable puppet from being
|
|
able to be run directly on the host, it merely prevents ansible from
|
|
attempting to run it during the regular zuul jobs. If you choose to run
|
|
puppet manually on a host, take care to ensure that it has not been disabled
|
|
at the bridge level first.
|
|
|
|
If you need to pause all execution of ansible playbooks by Zuul you can
|
|
run the utility script ``disable-ansible``. The script touches the file
|
|
``/home/zuul/DISABLE-ANSIBLE`` on bridge.openstack.org. Doing
|
|
this forces the Zuul jobs that run ansible for us to wait until that file is
|
|
removed. This acts like a global pause. The script exists to prevent admins
|
|
from misspelling the name of the file and is recommended.
|
|
|
|
Examples
|
|
--------
|
|
|
|
To disable an OpenDev instance called `foo.opendev.org` temporarily,
|
|
ensure the following is in `/etc/ansible/hosts/emergency.yaml`
|
|
|
|
::
|
|
|
|
# Please add an inline comment so we know who added the host and why
|
|
plugin: yamlgroup
|
|
groups:
|
|
disabled:
|
|
- foo.opendev.org # 2020-05-23 bob is testing change 654321
|
|
|
|
Ad-hoc Ansible runs
|
|
===================
|
|
|
|
If you need to run Ansible manually against a host, you should
|
|
|
|
* disable automated Ansible runs following the section above
|
|
* ``su`` to the ``zuul`` user and run the playbook with something like
|
|
``ansible-playbook -vv
|
|
src/opendev.org/opendev/system-config/playbooks/service-<name>.yaml``
|
|
* Restore automated ansible runs.
|
|
* You can also use the ``--limit`` flag to restrict which hosts run
|
|
when there are many in a group. However, be aware that some
|
|
roles/playbooks like ``letsencrypt`` and ``backup`` run across
|
|
multiple hosts (deploying DNS records or authorization keys), so
|
|
incorrect ``--limit`` flags could cause further failures.
|
|
|
|
.. _cinder:
|
|
|
|
Cinder Volume Management
|
|
========================
|
|
|
|
Adding a New Device
|
|
-------------------
|
|
|
|
If the main volume group doesn't have enough space for what you want
|
|
to do, this is how you can add a new volume.
|
|
|
|
Log into bridge.openstack.org and run::
|
|
|
|
export OS_CLOUD=openstackci-rax
|
|
export OS_REGION_NAME=DFW
|
|
|
|
openstack server list
|
|
openstack volume list
|
|
|
|
Change the variables to use a different environment. ORD for example::
|
|
|
|
export OS_CLOUD=openstackci-rax
|
|
export OS_REGION_NAME=ORD
|
|
|
|
* Add a new 1024G cinder volume (substitute the hostname and the next number
|
|
in series for NN)::
|
|
|
|
openstack volume create --size 1024 "$HOSTNAME.ord.openstack.org/mainNN"
|
|
openstack server add volume "HOSTNAME.openstack.org" "HOSTNAME.openstack.org/mainNN"
|
|
|
|
* or to add a 100G SSD volume::
|
|
|
|
openstack volume create --type SSD --size 100 "HOSTNAME.openstack.org/mainNN"
|
|
openstack server add volume "HOSTNAME.openstack.org" "HOSTNAME.openstack.org/mainNN"
|
|
|
|
* Then, on the host, create the partition table::
|
|
|
|
DEVICE=/dev/xvdX
|
|
sudo parted $DEVICE mklabel msdos mkpart primary 0% 100% set 1 lvm on
|
|
sudo pvcreate ${DEVICE}1
|
|
|
|
* It should show up in pvs::
|
|
|
|
$ sudo pvs
|
|
PV VG Fmt Attr PSize PFree
|
|
/dev/xvdX1 lvm2 a- 1024.00g 1024.00g
|
|
|
|
* Add it to the main volume group::
|
|
|
|
sudo vgextend main ${DEVICE}1
|
|
|
|
* However, if the volume group does not exist yet, you can create it::
|
|
|
|
sudo vgcreate main ${DEVICE}1
|
|
|
|
Creating a New Logical Volume
|
|
-----------------------------
|
|
|
|
Make sure there is enough space in the volume group::
|
|
|
|
$ sudo vgs
|
|
VG #PV #LV #SN Attr VSize VFree
|
|
main 4 2 0 wz--n- 2.00t 347.98g
|
|
|
|
If not, see `Adding a New Device`_.
|
|
|
|
Create the new logical volume and initialize the filesystem::
|
|
|
|
NAME=newvolumename
|
|
sudo lvcreate -L1500GB -n $NAME main
|
|
|
|
sudo mkfs.ext4 -m 0 -j -L $NAME /dev/main/$NAME
|
|
sudo tune2fs -i 0 -c 0 /dev/main/$NAME
|
|
|
|
Be sure to add it to ``/etc/fstab``.
|
|
|
|
Expanding an Existing Logical Volume
|
|
------------------------------------
|
|
|
|
Make sure there is enough space in the volume group::
|
|
|
|
$ sudo vgs
|
|
VG #PV #LV #SN Attr VSize VFree
|
|
main 4 2 0 wz--n- 2.00t 347.98g
|
|
|
|
If not, see `Adding a New Device`_.
|
|
|
|
The following example increases the size of a volume by 100G::
|
|
|
|
NAME=volumename
|
|
sudo lvextend -L+100G /dev/main/$NAME
|
|
sudo resize2fs /dev/main/$NAME
|
|
|
|
The following example increases the size of a volume to the maximum allowable::
|
|
|
|
NAME=volumename
|
|
sudo lvextend -l +100%FREE /dev/main/$NAME
|
|
sudo resize2fs /dev/main/$NAME
|
|
|
|
Replace an Existing Device
|
|
--------------------------
|
|
|
|
We generally need to do this if our cloud provider is planning maintenance to a
|
|
volume. We usually get a few days heads up on maintenance window, so depending
|
|
on the size of the volume, it may take some time to replace.
|
|
|
|
First thing to do is add the replacement device to the server, see
|
|
`Adding a New Device`_. Be sure the replacement volume is the same type / size
|
|
as the existing.
|
|
|
|
If the step above were followed, you should see something like::
|
|
|
|
$ sudo pvs
|
|
PV VG Fmt Attr PSize PFree
|
|
/dev/xvdb1 main lvm2 a-- 50.00g 0
|
|
/dev/xvdc1 main lvm2 a-- 50.00g 50.00g
|
|
|
|
Be sure both devices are in the same VG (volume group), if not you did not
|
|
properly extend the device.
|
|
|
|
.. note::
|
|
Be sure to use a screen session for the following step!
|
|
|
|
Next is to move the data from once device to another::
|
|
|
|
$ sudo pvmove /dev/xvdb1 /dev/xvdc1
|
|
/dev/xvdb1: Moved: 0.0%
|
|
/dev/xvdb1: Moved: 1.8%
|
|
...
|
|
...
|
|
/dev/xvdb1: Moved: 99.4%
|
|
/dev/xvdb1: Moved: 100.0%
|
|
|
|
Confirm all the data was moved, and the original device is empty (PFree)::
|
|
|
|
$ sudo pvs
|
|
PV VG Fmt Attr PSize PFree
|
|
/dev/xvdb1 main lvm2 a-- 50.00g 50.00g
|
|
/dev/xvdc1 main lvm2 a-- 50.00g 0
|
|
|
|
And remove the device from the main volume group::
|
|
|
|
$ sudo vgreduce main /dev/xvdb1
|
|
Removed "/dev/xvdb1" from volume group "main"
|
|
|
|
To be safe, we can also wipe the label from LVM::
|
|
|
|
$ sudo pvremove /dev/xvdb1
|
|
Labels on physical volume "/dev/xvdb1" successfully wiped
|
|
|
|
Leaving us with just a single device::
|
|
|
|
$ sudo pvs
|
|
PV VG Fmt Attr PSize PFree
|
|
/dev/xvdc1 main lvm2 a-- 50.00g 0
|
|
|
|
At this time, you are able to remove the original volume from openstack if
|
|
no longer needed.
|
|
|
|
Email
|
|
=====
|
|
|
|
There is a shared email account used for Infrastructure related mail
|
|
(account sign-ups, support tickets, etc). Root admins should ensure
|
|
they have access to this account; access credentials are available
|
|
from any existing member.
|