Clark Boylan b712584e53 Add AFS maintenance docs
This adds documentation on how to maintenance on the AFS cluster with no
service outages.

Change-Id: Idf9ab67603a1c5e8ac062458f3d17399d807e3a8
2017-04-14 11:16:35 -07:00

18 KiB

title

OpenAFS

OpenAFS

The Andrew Filesystem (or AFS) is a global distributed filesystem. With a single mountpoint, clients can access any site on the Internet which is running AFS as if it were a local filesystem.

OpenAFS is an open source implementation of the AFS services and utilities.

A collection of AFS servers and volumes that are collectively administered within a site is called a cell. The OpenStack project runs the openstack.org AFS cell, accessible at /afs/openstack.org/.

At a Glance

Hosts
  • afsdb01.openstack.org (a vldb and pts server in DFW)
  • afsdb02.openstack.org (a vldb and pts server in ORD)
  • afs01.dfw.openstack.org (a fileserver in DFW)
  • afs02.dfw.openstack.org (a second fileserver in DFW)
  • afs01.ord.openstack.org (a fileserver in ORD)
Puppet
Projects
Bugs
Resources

OpenStack Cell

AFS may be one of the most thoroughly documented systems in the world. There is plenty of very good information about how AFS works and the commands to use it. This document will only cover the mininmum needed to understand our deployment of it.

OpenStack runs an AFS cell called openstack.org. There are three important services provided by a cell: the volume location database (VLDB), the protection database (PTS), and the file server (FS). The volume location service answers queries from clients about which fileservers should be contacted to access particular volumes, while the protection service provides information about users and groups.

Our implementation follows the common recommendation to colocate the VLDB and PTS servers, and so they both run on our afsdb* servers. These servers all have the same information and communicate with each other to keep in sync and automatically provide high-availability service. For that reason, one of our DB servers is in the DFW region, and the other in ORD.

Fileservers contain volumes, each of which is a portion of the file space provided by that cell. A volume appears as at least one directory, but may contain directories within the volume. Volumes are mounted within other volumes to construct the filesystem hierarchy of the cell.

OpenStack has two fileservers in DFW and one in ORD. They do not automatically contain copies of the same data. A read-write volume in AFS can only exist on exactly one fileserver, and if that fileserver is out of service, the volumes it serves are not available. However, volumes may have read-write copies which are stored on other fileservers. If a client requests a read-only volume, as long as one site with a read-only volume is online, it will be available.

Client Configuration

To use OpenAFS on a Debian or Ubuntu machine:

sudo apt-get install openafs-client openafs-krb5 krb5-user

Debconf will ask you for a default realm, cell and cache size. Answer:

Default Kerberos version 5 realm: OPENSTACK.ORG
AFS cell this workstation belongs to: openstack.org
Size of AFS cache in kB: 500000

The default cache size in debconf is 50000 (50MB) which is not very large. We recommend setting it to 500000 (500MB -- add a zero to the default debconf value), or whatever is appropriate for your system.

The OpenAFS client is not started by default, so you will need to run:

sudo service openafs-client start

When it's done, you should be able to cd /afs/openstack.org.

Most of what is in our AFS cell does not require authentication. However, if you have a principal in kerberos, you can get an authentication token for use with AFS with:

kinit
aklog

Administration

The following information is relevant to AFS administrators.

All of these commands have excellent manpages which can be accessed with commands like man vos or man vos create. They also provide short help messages when run like vos -help or vos create -help.

For all administrative commands, you may either run them from any AFS client machine while authenticated as an AFS admin, or locally without authentication on an AFS server machine by appending the -localauth flag to the end of the command.

Adding a User

First, add a kerberos principal as described in addprinc. Have the username and UID from puppet ready.

Then add the user to the protection database with:

pts createuser $USERNAME -id UID

Admin UIDs start at 1 and increment. If you are adding a new admin user, you must run pts listentries, find the highest UID for an admin user, increment it by one and use that as the UID. The username for an admin user should be in the form username.admin.

Note

Any '/' characters in a kerberos principal become '.' characters in AFS.

Adding a Superuser

Run the following commands to add an existing principal to AFS as a superuser:

bos adduser -server afsdb01.openstack.org -user $USERNAME.admin
bos adduser -server afsdb02.openstack.org -user $USERNAME.admin
bos adduser -server afs01.dfw.openstack.org -user $USERNAME.admin
bos adduser -server afs02.dfw.openstack.org -user $USERNAME.admin
bos adduser -server afs01.ord.openstack.org -user $USERNAME.admin
pts adduser -user $USERNAME.admin -group system:administrators

Creating a Volume

Select a fileserver for the read-write copy of the volume according to which region you wish to locate it after ensuring it has sufficient free space. Then run:

vos create $FILESERVER a $VOLUMENAME

The a in the preceding command tells it to place the volume on partition vicepa. Our fileservers only have one partition and therefore this is a constant.

Be sure to mount the read-write volume in AFS with:

fs mkmount /afs/.openstack.org/path/to/mountpoint $VOLUMENAME

You may want to create read-only sites for the volume with vos addsite and then vos release.

You should set the volume quota with fs setquota.

Adding a Fileserver

Put the machine's public IP on a single line in /var/lib/openafs/local/NetInfo (TODO: puppet this).

Copy /etc/openafs/server/* from an existing fileserver.

Create an LVM volume named vicepa from cinder volumes. See cinder for details on volume management. Then run:

mkdir /vicepa
echo "/dev/main/vicepa  /vicepa ext4  errors=remount-ro,barrier=0  0  2" >>/etc/fstab
mount -a

Finally, create the fileserver with:

bos create NEWSERVER dafs dafs \
  -cmd "/usr/lib/openafs/dafileserver -p 23 -busyat 600 -rxpck 400 -s 1200 -l  1200 -cb 65535 -b 240 -vc 1200" \
  -cmd /usr/lib/openafs/davolserver \
  -cmd /usr/lib/openafs/salvageserver \
  -cmd /usr/lib/openafs/dasalvager

Mirrors

We host mirrors in AFS so that we store only one copy of the data, but mirror servers local to each cloud region in which we operate serve that data to nearby hosts from their local cache.

All of our mirrors are housed under /afs/openstack.org/mirror. Each mirror is on its own volume, and each with a read-only replica. This allows mirrors to be updated and then the read-only replicas atomically updated. Because mirrors are typically very large and replication across regions is slow, we place both copies of mirror data on two fileservers in the same region. This allows us to perform maintenance on fileservers hosting mirror data as well deal with outages related to a single server, but does not protect the mirror system from a region-wide outage.

In order to establish a new mirror, do the following:

  • The following commands need to be run authenticated on a host with kerberos and AFS setup (see afs_client; admins can run the commands on mirror-update.openstack.org). Firstly kinit and aklog to get tokens.

  • Create the mirror volume. See Creating a Volume for details. The volume should be named mirror.foo, where foo is descriptive of the contents of the mirror. Example:

    vos create afs01.dfw.openstack.org a mirror.foo
  • Create read-only replicas of the volume. One replica should be located on the same fileserver (it will take little to no additional space), and at least one other replica on a different fileserver. Example:

    vos addsite afs01.dfw.openstack.org a mirror.foo
    vos addsite afs02.dfw.openstack.org a mirror.foo
  • Release the read-only replicas:

    vos release mirror.foo

    See the status of all volumes with:

    vos listvldb

When traversing from a read-only volume to another volume across a mountpoint, AFS will first attempt to use a read-only replica of the destination volume if one exists. In order to naturally cause clients to prefer our read-only paths for mirrors, the entire path up to that point is composed of read-only volumes:

/afs             [root.afs]
  /openstack.org [root.cell]
    /mirror      [mirror]
      /bar       [mirror.bar]

In order to mount the mirror.foo volume under mirror we need to modify the read-write version of the mirror volume. To make this easy, the read-write version of the cell root is mounted at /afs/.openstack.org. Folllowing the same logic from earlier, traversing to paths below that mount point will generally prefer read-write volumes.

  • Mount the volume into afs using the read-write path:

    fs mkmount /afs/.openstack.org/mirror/foo mirror.foo
  • Release the mirror volume so that the (currently empty) foo mirror itself appears in directory listings under /afs/openstack.org/mirror:

    vos release mirror
  • Create a principal for the mirror update process. See addprinc for details. The principal should be called service/foo-mirror. Example:

    kadmin: addprinc -randkey service/foo-mirror@OPENSTACK.ORG
    kadmin: ktadd -k /path/to/foo.keytab service/foo-mirror@OPENSTACK.ORG
  • Add the service principal's keytab to hiera. Copy the binary key to puppetmaster.openstack.org and then use hieraedit to update the files

    root@puppetmaster:~# /opt/system-config/production/tools/hieraedit.py \
      --yaml /etc/puppet/hieradata/production/fqdn/mirror-update.openstack.org.yaml \
      -f /path/to/foo.keytab KEYNAME

    (don't forget to git commit and save the change; you can remove the copies of the binary key too). The key will be base64 encoded in the heira database. If you need to examine it for some reason you can use base64:

    cat /path/to/foo.keytab | base64
  • Add the new key to mirror-update.openstack.org in manifests/site.pp for the mirror scripts to use during update.

  • Create an AFS user for the service principal:

    pts createuser service.foo-mirror

Because mirrors usually have a large number of directories, it is best to avoid frequent ACL changes. To this end, we grant access to the mirror directories to a group where we can easily modify group membership if our needs change.

  • Create a group to contain the service principal, and add the principal:

    pts creategroup foo-mirror
    pts adduser service.foo-mirror foo-mirror

    View users, groups, and their membership with:

    pts listentries
    pts listentries -group
    pts membership foo-mirror
  • Grant the group access to the mirror volume:

    fs setacl /afs/.openstack.org/mirror/foo foo-mirror write
  • Grant anonymous users read access:

    fs setacl /afs/.openstack.org/mirror/foo system:anyuser read
  • Set the quota on the volume (e.g., 100GB):

    fs setquota /afs/.openstack.org/mirror/foo 100000000

Because the initial replication may take more time than we allocate in our mirror update cron jobs, manually perform the first mirror update:

  • In screen, obtain the lock on mirror-update.openstack.org:

    flock -n /var/run/foo-mirror/mirror.lock bash

    Leave that running while you perform the rest of the steps.

  • Also in screen on mirror-update, run the initial mirror sync. If using one of the mirror update scripts (from /usr/local/bin) be aware that they generally run the update process under timeout with shorter periods than may be required for the initial full sync.

  • Log into afs01.dfw.openstack.org and run screen. Within that session, periodically during the sync, and once again after it is complete, run:

    vos release mirror.foo -localauth

    It is important to do this from an AFS server using -localauth rather than your own credentials and inside of screen because if vos release is interrupted, it will require some manual cleanup (data will not be corrupted, but clients will not see the new volume until it is successfully released). Additionally, vos release has a bug where it will not use renewed tokens and so token expiration during a vos release may cause a similar problem.

  • Once the initial sync and and vos release are complete, release the lock file on mirror-update.

Removing a mirror

If you need to remove a mirror, you can do the following:

  • Unmount the volume from the R/W location:

    fs rmmount /afs/.openstack.org/mirror/foo
  • Release the R/O mirror volume to reflect the changes:

    vos release mirror
  • Check what servers the volumes are on with vos listvldb:

    VLDB entries for all servers
    ...
    
    mirror.foo
        RWrite: 536870934     ROnly: 536870935
        number of sites -> 3
           server afs01.dfw.openstack.org partition /vicepa RW Site
           server afs01.dfw.openstack.org partition /vicepa RO Site
           server afs01.ord.openstack.org partition /vicepa RO Site
     ...
  • Remove the R/O replicas (you can also see these with vos listvol -server afs0[1|2].dfw.openstack.org):

    vos remove -server afs01.dfw.openstack.org -partition a -id mirror.foo.readonly
    vos remove -server afs02.dfw.openstack.org -partition a -id mirror.foo.readonly
  • Remove the R/W volume:

    vos remove -server afs02.dfw.openstack.org -partition a -id mirror.foo
Reverse Proxy Cache

Each of the region-local mirror hosts exposes a limited reverse HTTP proxy on port 8080. These proxies run within the same Apache setup as used to expose AFS mirror contents. mod_cache is used to expose a white-listed set of resources (currently just RDO).

Currently they will cache data for up to 24 hours (Apache default) with pruning performed by htcacheclean once an hour to keep the cache size at or under 2GB of disk space.

The reverse proxy is provided because there are some hosted resources that are not currently able to be practically mirrored. Examples of this include RDO (rsync from RDO is slow and they update frequently) and docker images (which require specialized software to run a docker registry and then sorting out how to run that on a shared filesystem).

Apache was chosen because we already had configuration management in place for Apache on these hosts. This avoids management overheads of a completely new service deployment such as Squid or a caching docker registry daemon.

No Outage Server Maintenance

afsdb0X.openstack.org

We have redundant AFS DB servers. You can take one down without causing a service outage as long as the other remains up. To do this safely:

root@afsdb01:~# bos shutdown afsdb01.openstack.org -wait -localauth
root@afsdb01:~# bos status afsdb01.openstack.org -localauth
Instance ptserver, temporarily disabled, currently shutdown.
Instance vlserver, temporarily disabled, currently shutdown.

Then perform your maintenance on afsdb01. When done a reboot will automatically restart the bos service or you can manually restart the openafs-fileserver service:

root@afsdb01:~# service openafs-fileserver start

Finally check that the service is back up and running:

root@afsdb01:~# bos status afsdb01.openstack.org -localauth
Instance ptserver, currently running normally.
Instance vlserver, currently running normally.

Now you can repeat the process against afsdb02.

afs0X.openstack.org

Taking down the actual fileservers is slightly more complicated but works similarly. Basically what we need to do is make sure that either no one needs the RW volumes hosted by a fileserver before taking it down or move the RW volume to another fileserver.

To ensure nothing needs the RW volumes you can hold the various file locks on hosts that publish to AFS and/or remove cron entries that perform vos releases or volume writes.

If instead you need to move the RW volume first step is checking where the volumes live:

root@afsdb01:~# vos listvldb -localauth
VLDB entries for all servers

mirror
    RWrite: 536870934     ROnly: 536870935
    number of sites -> 3
       server afs01.dfw.openstack.org partition /vicepa RW Site
       server afs01.dfw.openstack.org partition /vicepa RO Site
       server afs01.ord.openstack.org partition /vicepa RO Site

We see that if we want to allow write to the mirror volume and take down afs01.dfw.openstack.org we will have to move the volume to one of the other servers:

root@afsdb01:~# screen # use screen as this may take quite some time.
root@afsdb01:~# vos move -id mirror -toserver afs01.ord.openstack.org -topartition vicepa -fromserver afs01.dfw.openstack.org -frompartition vicepa -localauth

When that is done (use listvldb command above to check) it is now safe to take down afs01.dfw.openstack.org while having writers to the mirror volume. We use the same process as for the db server:

root@afsdb01:~# bos shutdown afs01.dfw.openstack.org -localauth
root@afsdb01:~# bos status afsdb01.dfw.openstack.org -localauth
Auxiliary status is: file server shut down.

Perform maintenance, then restart as above and check the status again:

root@afsdb01:~# bos status afsdb01.dfw.openstack.org -localauth
Auxiliary status is: file server running.