2e961b1af0
At some point we shifted from doing this task using the web UI to primarily using ssh only admin accounts. The docs ended up in a slightly confusing place with steps that only make sense when you interact with the web UI. Update the force merge docs to assume ssh only which is far more aligned with our admin account expectations. Change-Id: Ia99afe7ee10927765733891f72bd428e52fa2225
531 lines
20 KiB
ReStructuredText
531 lines
20 KiB
ReStructuredText
:title: System Administration
|
|
|
|
This page collects technical information of relevance to those
|
|
interested in admin of OpenDev services. For a higher-level overview,
|
|
see :ref:`opendev-infra-overview`.
|
|
|
|
.. _sysadmin:
|
|
|
|
System Administration
|
|
#####################
|
|
|
|
.. _ssh-access:
|
|
|
|
SSH Access
|
|
==========
|
|
|
|
For any of the systems managed by the OpenDev Infrastructure team, the
|
|
following practices must be observed for SSH access:
|
|
|
|
* SSH access is only permitted with SSH public/private key
|
|
authentication.
|
|
* Users must use a strong passphrase to protect their private key. A
|
|
passphrase of several words, at least one of which is not in a
|
|
dictionary is advised, or a random string of at least 16
|
|
characters.
|
|
* To mitigate the inconvenience of using a long passphrase, users may
|
|
want to use an SSH agent so that the passphrase is only requested
|
|
once per desktop session.
|
|
* Users private keys must never be stored anywhere except their own
|
|
workstation(s). In particular, they must never be stored on any
|
|
remote server.
|
|
* If users need to 'hop' from a server or bastion host to another
|
|
machine, they must not copy a private key to the intermediate
|
|
machine (see above). Instead SSH agent forwarding may be used.
|
|
However due to the potential for a compromised intermediate machine
|
|
to ask the agent to sign requests without the users knowledge, in
|
|
this case only an SSH agent that interactively prompts the user
|
|
each time a signing request (ie, ssh-agent, but not gnome-keyring)
|
|
is received should be used, and the SSH keys should be added with
|
|
the confirmation constraint ('ssh-add -c').
|
|
* The number of SSH keys that are configured to permit access to
|
|
OpenDev machines should be kept to a minimum.
|
|
* OpenDev Infrastructure machines must use Ansible to centrally manage
|
|
and configure user accounts, and the SSH authorized_keys files from
|
|
the opendev/system-config repository.
|
|
* SSH keys should be periodically rotated (at least once per year).
|
|
During rotation, a new key can be added to puppet for a time, and
|
|
then the old one removed.
|
|
|
|
|
|
Gerrit Admins
|
|
=============
|
|
|
|
To provide a reasonable firewall from outside authentication systems,
|
|
Gerrit administrators keep two accounts: one for normal code review
|
|
activity and one for performing Gerrit administration. Following the same
|
|
pattern as our Kerberos administrator account logins, the admin account
|
|
corresponding to ``$USER`` would be ``$USER.admin`` (Gerrit doesn't allow
|
|
``/`` in usernames) so they can be easily identified when auditing
|
|
activity. Unlike the normal code review account, the admin account should
|
|
have no OpenID so that it is only accessable by API/CLI methods so they
|
|
cannot be compromised at the third-party ID provider.
|
|
|
|
To create a personal Gerrit admin account from a shell on the server, run
|
|
the following command
|
|
|
|
.. code-block:: shell-session
|
|
|
|
$ sudo -u gerrit2 ssh -i ~gerrit2/review_site/etc/ssh_host_rsa_key -p 29418 -l 'Gerrit Code Review' localhost "suexec --as openstack-project-creator -- gerrit create-account --group Administrators --full-name myname.admin --ssh-key 'ssh-rsa AAAA...BCDE myname@computer' myname.admin"
|
|
|
|
We ``suexec`` as the ``openstack-project-creator`` account because the
|
|
magic ``Gerrit Code Review`` pseudoaccount can't set group memberships so
|
|
we need to run that command as a user which is already in the
|
|
``Administrators`` group. With an account like this, routine actions like
|
|
populating new groups with initial members is still quite simple
|
|
|
|
.. code-block:: shell-session
|
|
|
|
$ ssh -p 29418 myname.admin@review.opendev.org "gerrit set-members some-new-group --add somebody@example.org"
|
|
|
|
Another common example is bypassing Zuul to submit a change for merging
|
|
directly to a project. See :ref:`force-merging-a-change` for details.
|
|
|
|
GitHub Access
|
|
=============
|
|
|
|
To ensure that code review and testing are not bypassed in the public
|
|
Git repositories, only Gerrit will be permitted to commit code to
|
|
OpenDev repositories. Because GitHub always allows project
|
|
administrators to commit code, accounts that have access to manage the
|
|
GitHub projects necessarily will have commit access to the
|
|
repositories.
|
|
|
|
A shared Github administrative account is available (credentials
|
|
stored in the global authentication location). If administrators
|
|
would prefer to keep a separate account, it can be added to the
|
|
organisation after discussion and noting the caveats around elevated
|
|
access. The account must have 2FA enabled.
|
|
|
|
In either case, the administrator accounts should not be used to check
|
|
out or commit code for any project.
|
|
|
|
Note that it is unlikely to be useful to use an account also used for
|
|
active development, as you will be subscribed to many notifications
|
|
for all projects.
|
|
|
|
Root only information
|
|
#####################
|
|
|
|
Below is information relevant to members of the core team with root
|
|
access.
|
|
|
|
Accessing Clouds
|
|
================
|
|
|
|
As an unprivileged user who is a member of the `sudo` group on bridge,
|
|
you can inspect any of the clouds with
|
|
|
|
.. code-block:: shell-session
|
|
|
|
$ sudo openstack --os-cloud <cloud name> --os-cloud-region <region name>
|
|
|
|
Backups
|
|
=======
|
|
|
|
Infra uses the `borg <https://borgbackup.readthedocs.io>`__ backup
|
|
tool.
|
|
|
|
Hosts in the ``borg-backup`` Ansible inventory group will be backed up
|
|
to servers in the ``borg-backup-server`` group with ``borg``. The
|
|
``playbooks/roles/borg-backup`` and
|
|
``playbooks/roles/borg-backup-server`` roles implement the required
|
|
setup.
|
|
|
|
The backup server has a unique Unix user for each host to be backed
|
|
up. The roles will setup required users, their home directories in
|
|
the backup volume and relevant ``authorized_keys``.
|
|
|
|
Host backup happens via a daily cron job (managed by Ansible) on each
|
|
individual host to be backed up. The host to be backed up initiates
|
|
the backup process to the remote backup server(s) using a separate ssh
|
|
key setup just for backup communication (see ``/root/.ssh/config``).
|
|
|
|
Setting up hosts for backup
|
|
---------------------------
|
|
|
|
To setup a host for backup, put it in the ``borg-backup`` group.
|
|
|
|
Hosts can specify ``borg_backup_excludes_extra`` and
|
|
``borg_backup_dirs_extra`` to exclude or include specific directories
|
|
as required (see role documentation for more details).
|
|
|
|
``borg`` splits backup data into chunks and de-duplicates as much as
|
|
possible. For backing up large items, particularly things like
|
|
database dumps, we want to give ``borg`` as much chance to
|
|
de-duplicate as possible. Approaches such as dumping to compressed
|
|
files on disk defeat de-duplication because all the data changes for
|
|
each dump.
|
|
|
|
For dumping large data, hosts should put a file into
|
|
``/etc/borg-streams`` that performs the dump in an uncompressed manner
|
|
to stdout. The backup scripts will create a separate archive for each
|
|
stream defined here. For more details, see the ``backup`` role
|
|
documentation. These streams should attempt to be as friendly to
|
|
de-duplication as possible; see some of the examples of ``mysqldump``
|
|
to find arguments that help keep the output data more stable (and
|
|
hence more easily de-duplicated).
|
|
|
|
Restore from Backup
|
|
-------------------
|
|
|
|
Hosts have ``/usr/local/bin/borg-mount`` (specify one of the backup
|
|
servers as an argument) that will mount the backups to
|
|
``/opt/backups`` via FUSE.
|
|
|
|
``borg`` has other options for restoring. If you need to extract on
|
|
the backup server itself, a basic way to dump a host at a particular
|
|
time is to
|
|
|
|
* log into the backup server
|
|
* sudo ``su -`` to switch to the backup user for the host to be restored
|
|
* you will now be in the home directory of that user
|
|
* run ``/opt/borg/bin/borg list ./backup`` to list the archives available
|
|
* these should look like ``<hostname>-<stream>-YYYY-MM-DDTHH:MM:SS``
|
|
* move to working directory
|
|
* extract one of the appropriate archives with ``/opt/borg/bin/borg extract ~/backup <archive-tag>``
|
|
|
|
Managing backup storage
|
|
-----------------------
|
|
|
|
We run ``borg`` in append-only mode. This means clients can not
|
|
remove old backups on the server.
|
|
|
|
However, due to the way borg works, append-only mode plays all client
|
|
transactions into a transaction log until a read-write operation
|
|
occurs. Examining the repository will appear to have all these
|
|
transactions applied (e.g. pruned archives will not appear; even if
|
|
they have not actually been pruned from disk). If you have reason to
|
|
not trust the state of the backup, you should *not* run any read-write
|
|
operations. You will need to manually examine the transaction log and
|
|
roll-back to a known good state; see
|
|
`<https://borgbackup.readthedocs.io/en/stable/usage/notes.html#append-only-mode>`__.
|
|
|
|
However, we have limited backup space. Each backup server has a
|
|
script ``/usr/local/bin/prune-borg-backups`` which can be run to
|
|
reclaim space. This should be run in a ``screen`` instance as it can
|
|
take a considerable time. It will prompt when run; you can confirm
|
|
the process with a ``noop`` run; confirming the prune will log the
|
|
output to ``/opt/backups``. This will keep the last 7 days of backups,
|
|
then monthly backups for 1 year and yearly backups for each archive.
|
|
The backup servers will send a warning when backup volume usage is
|
|
high, at which point this can be run manually.
|
|
|
|
.. _force-merging-a-change:
|
|
|
|
Force-Merging a Change
|
|
======================
|
|
|
|
Occasionally it is necessary to bypass the CI system and merge a
|
|
change directly. Usually, this is only required if we have a hole in
|
|
our testing of the CI or related systems themselves and have merged a
|
|
change which causes them to be unable to operate normally and
|
|
therefore unable to merge a reversion of the problematic change. In
|
|
these cases, use the following procedure to force-merge a change.
|
|
|
|
* Add yourself to the *Project Bootstrappers* group in Gerrit.
|
|
|
|
.. code-block:: shell-session
|
|
|
|
$ ssh -p 29418 myname.admin@review.opendev.org \
|
|
"gerrit set-members 'Project Bootstrappers' --add myname.admin"
|
|
|
|
* Changes with Code-Review -2, Verified -2, or Workflow -1 votes cannot
|
|
merge. If the change has any of these votes you will need to remove them
|
|
first. We can do that via SSH by removing users with those votes from the
|
|
reviewer list:
|
|
|
|
.. code-block:: shell-session
|
|
|
|
$ ssh -p 29418 myname.admin@review.opendev.org \
|
|
"gerrit set-reviewers --project foo/bar --remove $USER_WITH_VOTE 123456"
|
|
|
|
* To merge the change needs a Code-Review +2, Verified +2, and Workflow +1.
|
|
We will apply those votes and ask Gerrit to submit (merge) the change
|
|
using a single `gerrit review` command:
|
|
|
|
.. code-block:: shell-session
|
|
|
|
$ ssh -p 29418 myname.admin@review.opendev.org \
|
|
"gerrit review 12345,6 --message 'Bypassing Zuul to merge this.' \
|
|
--code-review=2 --verified=2 --label workflow=1 --submit"
|
|
|
|
Please edit the message argment to provide as much detail as possible
|
|
for why the normal processes were bypassed in this situation.
|
|
|
|
* Remove yourself from *Project Bootstrappers*
|
|
|
|
.. code-block:: shell-session
|
|
|
|
$ ssh -p 29418 myname.admin@review.opendev.org \
|
|
"gerrit set-members 'Project Bootstrappers' --remove myname.admin"
|
|
|
|
This procedure is the safest way to force-merge a change, ensuring
|
|
that all of the normal steps that Gerrit performs on repos still
|
|
happen.
|
|
|
|
Note that it's possible to temporarily add your normal OpenID-associated
|
|
WebUI account to the ``Administrators`` group or other groups with similar
|
|
superuser permissions like ``Project Bootstrappers``, but keep in mind that
|
|
an attacker who has quietly gained control of your account at the OpenID
|
|
provider could be waiting for that opportunity to take advantage of the
|
|
added permissions, or you may simply forget to remove the account afterward
|
|
negating the added safety of this account separation.
|
|
|
|
For more examples, see the detailed documentation for Gerrit's SSH CLI,
|
|
available on our server:
|
|
https://review.opendev.org/Documentation/cmd-index.html
|
|
|
|
Launching New Servers
|
|
=====================
|
|
|
|
New servers are launched using the ``launch/launch-node.py`` tool from the git
|
|
repository ``https://opendev.org/opendev/system-config``. This
|
|
tool is run from a checkout on the bridge - please see :git_file:`launch/README.rst`
|
|
for detailed instructions.
|
|
|
|
.. _disable-enable-ansible:
|
|
|
|
Disable/Enable Ansible
|
|
======================
|
|
|
|
You should normally not make manual changes to servers, but instead,
|
|
make changes through ansible or puppet. However, under some circumstances,
|
|
you may need to temporarily make a manual change to a managed
|
|
resource on a server.
|
|
|
|
OpenDev uses a Static Inventory in Ansible to control execution of Ansible
|
|
on hosts. A full understanding
|
|
of the concepts in
|
|
`Ansible Inventory Introduction
|
|
<http://docs.ansible.com/ansible/intro_inventory.html>`_
|
|
is essential for being able to make informed decisions about actions
|
|
to take.
|
|
|
|
In the case of needing to disable the running of ansible or puppet on a node,
|
|
it's a simple matter of adding an entry to the ansible inventory "disabled" group
|
|
in :git_file:`inventory/groups.yaml`. The
|
|
disabled entry is an input to `ansible --list-hosts` so you can check your
|
|
entry simply by running it with `ansible $hostlist --list-hosts` as root
|
|
on the bridge host and ensuring that the list of hosts returned is as
|
|
expected. Globs, group names and server UUIDs should all be acceptable input.
|
|
|
|
If you need to disable a host immediately without waiting for a patch to land
|
|
to `system-config`, there is a file on the bridge host,
|
|
`/etc/ansible/hosts/emergency.yaml` that can be edited directly.
|
|
|
|
`/etc/ansible/hosts/emergency.yaml` is a file that should normally be empty,
|
|
but the contents are not managed by ansible. It's purpose is to allow for
|
|
disabling ansible at times when landing a change to the ansible repo would be
|
|
either unreasonable or impossible.
|
|
|
|
Disabling puppet via ansible inventory does not disable puppet from being
|
|
able to be run directly on the host, it merely prevents ansible from
|
|
attempting to run it during the regular zuul jobs. If you choose to run
|
|
puppet manually on a host, take care to ensure that it has not been disabled
|
|
at the bridge level first.
|
|
|
|
If you need to pause all execution of ansible playbooks by Zuul you can
|
|
run the utility script ``disable-ansible``. The script touches the file
|
|
``/home/zuul/DISABLE-ANSIBLE`` on bridge.openstack.org. Doing
|
|
this forces the Zuul jobs that run ansible for us to wait until that file is
|
|
removed. This acts like a global pause. The script exists to prevent admins
|
|
from misspelling the name of the file and is recommended.
|
|
|
|
Examples
|
|
--------
|
|
|
|
To disable an OpenDev instance called `foo.opendev.org` temporarily,
|
|
ensure the following is in `/etc/ansible/hosts/emergency.yaml`
|
|
|
|
::
|
|
|
|
# Please add an inline comment so we know who added the host and why
|
|
plugin: yamlgroup
|
|
groups:
|
|
disabled:
|
|
- foo.opendev.org # 2020-05-23 bob is testing change 654321
|
|
|
|
Ad-hoc Ansible runs
|
|
===================
|
|
|
|
If you need to run Ansible manually against a host, you should
|
|
|
|
* disable automated Ansible runs following the section above
|
|
* ``su`` to the ``zuul`` user and run the playbook with something like
|
|
``ansible-playbook -vv
|
|
src/opendev.org/opendev/system-config/playbooks/service-<name>.yaml``
|
|
* Restore automated ansible runs.
|
|
* You can also use the ``--limit`` flag to restrict which hosts run
|
|
when there are many in a group. However, be aware that some
|
|
roles/playbooks like ``letsencrypt`` and ``backup`` run across
|
|
multiple hosts (deploying DNS records or authorization keys), so
|
|
incorrect ``--limit`` flags could cause further failures.
|
|
|
|
.. _cinder:
|
|
|
|
Cinder Volume Management
|
|
========================
|
|
|
|
Adding a New Device
|
|
-------------------
|
|
|
|
If the main volume group doesn't have enough space for what you want
|
|
to do, this is how you can add a new volume.
|
|
|
|
Log into bridge.openstack.org and run::
|
|
|
|
export OS_CLOUD=openstackci-rax
|
|
export OS_REGION_NAME=DFW
|
|
|
|
openstack server list
|
|
openstack volume list
|
|
|
|
Change the variables to use a different environment. ORD for example::
|
|
|
|
export OS_CLOUD=openstackci-rax
|
|
export OS_REGION_NAME=ORD
|
|
|
|
* Add a new 1024G cinder volume (substitute the hostname and the next number
|
|
in series for NN)::
|
|
|
|
openstack volume create --size 1024 "$HOSTNAME.ord.openstack.org/mainNN"
|
|
openstack server add volume "HOSTNAME.openstack.org" "HOSTNAME.openstack.org/mainNN"
|
|
|
|
* or to add a 100G SSD volume::
|
|
|
|
openstack volume create --type SSD --size 100 "HOSTNAME.openstack.org/mainNN"
|
|
openstack server add volume "HOSTNAME.openstack.org" "HOSTNAME.openstack.org/mainNN"
|
|
|
|
* Then, on the host, create the partition table::
|
|
|
|
DEVICE=/dev/xvdX
|
|
sudo parted $DEVICE mklabel msdos mkpart primary 0% 100% set 1 lvm on
|
|
sudo pvcreate ${DEVICE}1
|
|
|
|
* It should show up in pvs::
|
|
|
|
$ sudo pvs
|
|
PV VG Fmt Attr PSize PFree
|
|
/dev/xvdX1 lvm2 a- 1024.00g 1024.00g
|
|
|
|
* Add it to the main volume group::
|
|
|
|
sudo vgextend main ${DEVICE}1
|
|
|
|
* However, if the volume group does not exist yet, you can create it::
|
|
|
|
sudo vgcreate main ${DEVICE}1
|
|
|
|
Creating a New Logical Volume
|
|
-----------------------------
|
|
|
|
Make sure there is enough space in the volume group::
|
|
|
|
$ sudo vgs
|
|
VG #PV #LV #SN Attr VSize VFree
|
|
main 4 2 0 wz--n- 2.00t 347.98g
|
|
|
|
If not, see `Adding a New Device`_.
|
|
|
|
Create the new logical volume and initialize the filesystem::
|
|
|
|
NAME=newvolumename
|
|
sudo lvcreate -L1500GB -n $NAME main
|
|
|
|
sudo mkfs.ext4 -m 0 -j -L $NAME /dev/main/$NAME
|
|
sudo tune2fs -i 0 -c 0 /dev/main/$NAME
|
|
|
|
Be sure to add it to ``/etc/fstab``.
|
|
|
|
Expanding an Existing Logical Volume
|
|
------------------------------------
|
|
|
|
Make sure there is enough space in the volume group::
|
|
|
|
$ sudo vgs
|
|
VG #PV #LV #SN Attr VSize VFree
|
|
main 4 2 0 wz--n- 2.00t 347.98g
|
|
|
|
If not, see `Adding a New Device`_.
|
|
|
|
The following example increases the size of a volume by 100G::
|
|
|
|
NAME=volumename
|
|
sudo lvextend -L+100G /dev/main/$NAME
|
|
sudo resize2fs /dev/main/$NAME
|
|
|
|
The following example increases the size of a volume to the maximum allowable::
|
|
|
|
NAME=volumename
|
|
sudo lvextend -l +100%FREE /dev/main/$NAME
|
|
sudo resize2fs /dev/main/$NAME
|
|
|
|
Replace an Existing Device
|
|
--------------------------
|
|
|
|
We generally need to do this if our cloud provider is planning maintenance to a
|
|
volume. We usually get a few days heads up on maintenance window, so depending
|
|
on the size of the volume, it may take some time to replace.
|
|
|
|
First thing to do is add the replacement device to the server, see
|
|
`Adding a New Device`_. Be sure the replacement volume is the same type / size
|
|
as the existing.
|
|
|
|
If the step above were followed, you should see something like::
|
|
|
|
$ sudo pvs
|
|
PV VG Fmt Attr PSize PFree
|
|
/dev/xvdb1 main lvm2 a-- 50.00g 0
|
|
/dev/xvdc1 main lvm2 a-- 50.00g 50.00g
|
|
|
|
Be sure both devices are in the same VG (volume group), if not you did not
|
|
properly extend the device.
|
|
|
|
.. note::
|
|
Be sure to use a screen session for the following step!
|
|
|
|
Next is to move the data from once device to another::
|
|
|
|
$ sudo pvmove /dev/xvdb1 /dev/xvdc1
|
|
/dev/xvdb1: Moved: 0.0%
|
|
/dev/xvdb1: Moved: 1.8%
|
|
...
|
|
...
|
|
/dev/xvdb1: Moved: 99.4%
|
|
/dev/xvdb1: Moved: 100.0%
|
|
|
|
Confirm all the data was moved, and the original device is empty (PFree)::
|
|
|
|
$ sudo pvs
|
|
PV VG Fmt Attr PSize PFree
|
|
/dev/xvdb1 main lvm2 a-- 50.00g 50.00g
|
|
/dev/xvdc1 main lvm2 a-- 50.00g 0
|
|
|
|
And remove the device from the main volume group::
|
|
|
|
$ sudo vgreduce main /dev/xvdb1
|
|
Removed "/dev/xvdb1" from volume group "main"
|
|
|
|
To be safe, we can also wipe the label from LVM::
|
|
|
|
$ sudo pvremove /dev/xvdb1
|
|
Labels on physical volume "/dev/xvdb1" successfully wiped
|
|
|
|
Leaving us with just a single device::
|
|
|
|
$ sudo pvs
|
|
PV VG Fmt Attr PSize PFree
|
|
/dev/xvdc1 main lvm2 a-- 50.00g 0
|
|
|
|
At this time, you are able to remove the original volume from openstack if
|
|
no longer needed.
|
|
|
|
Email
|
|
=====
|
|
|
|
There is a shared email account used for Infrastructure related mail
|
|
(account sign-ups, support tickets, etc). Root admins should ensure
|
|
they have access to this account; access credentials are available
|
|
from any existing member.
|