swift/doc/source/ops_runbook/maintenance.rst
Tim Burke 52a4fe37aa Various doc formatting cleanups
* Get rid of a bunch of accidental blockquote formatting
* Always declare a lexer to use for ``.. code::`` blocks

Change-Id: I8940e75b094843e542e815dde6b6be4740751813
2022-08-02 14:28:36 -07:00

16 KiB

Server maintenance

General assumptions

  • It is assumed that anyone attempting to replace hardware components will have already read and understood the appropriate maintenance and service guides.
  • It is assumed that where servers need to be taken off-line for hardware replacement, that this will be done in series, bringing the server back on-line before taking the next off-line.
  • It is assumed that the operations directed procedure will be used for identifying hardware for replacement.

Assessing the health of swift

You can run the swift-recon tool on a Swift proxy node to get a quick check of how Swift is doing. Please note that the numbers below are necessarily somewhat subjective. Sometimes parameters for which we say 'low values are good' will have pretty high values for a time. Often if you wait a while things get better.

For example:

$ sudo swift-recon -rla
===============================================================================
[2012-03-10 12:57:21] Checking async pendings on 384 hosts...
Async stats: low: 0, high: 1, avg: 0, total: 1
===============================================================================

[2012-03-10 12:57:22] Checking replication times on 384 hosts...
[Replication Times] shortest: 1.4113877813, longest: 36.8293570836, avg: 4.86278064749
===============================================================================

[2012-03-10 12:57:22] Checking load avg's on 384 hosts...
[5m load average] lowest: 2.22, highest: 9.5, avg: 4.59578125
[15m load average] lowest: 2.36, highest: 9.45, avg: 4.62622395833
[1m load average] lowest: 1.84, highest: 9.57, avg: 4.5696875
===============================================================================

In the example above we ask for information on replication times (-r), load averages (-l) and async pendings (-a). This is a healthy Swift system. Rules-of-thumb for 'good' recon output are:

  • Nodes that respond are up and running Swift. If all nodes respond, that is a good sign. But some nodes may time out. For example:

    -> [http://<redacted>.29:6200/recon/load:] <urlopen error [Errno 111] ECONNREFUSED>
    -> [http://<redacted>.31:6200/recon/load:] <urlopen error timed out>
  • That could be okay or could require investigation.

  • Low values (say < 10 for high and average) for async pendings are good. Higher values occur when disks are down and/or when the system is heavily loaded. Many simultaneous PUTs to the same container can drive async pendings up. This may be normal, and may resolve itself after a while. If it persists, one way to track down the problem is to find a node with high async pendings (with swift-recon -av | sort -n -k4), then check its Swift logs, Often async pendings are high because a node cannot write to a container on another node. Often this is because the node or disk is offline or bad. This may be okay if we know about it.

  • Low values for replication times are good. These values rise when new rings are pushed, and when nodes and devices are brought back on line.

  • Our 'high' load average values are typically in the 9-15 range. If they are a lot bigger it is worth having a look at the systems pushing the average up. Run swift-recon -av to get the individual averages. To sort the entries with the highest at the end, run swift-recon -av | sort -n -k4.

For comparison here is the recon output for the same system above when two entire racks of Swift are down:

[2012-03-10 16:56:33] Checking async pendings on 384 hosts...
-> http://<redacted>.22:6200/recon/async: <urlopen error timed out>
-> http://<redacted>.18:6200/recon/async: <urlopen error timed out>
-> http://<redacted>.16:6200/recon/async: <urlopen error timed out>
-> http://<redacted>.13:6200/recon/async: <urlopen error timed out>
-> http://<redacted>.30:6200/recon/async: <urlopen error timed out>
-> http://<redacted>.6:6200/recon/async: <urlopen error timed out>
.........
-> http://<redacted>.5:6200/recon/async: <urlopen error timed out>
-> http://<redacted>.15:6200/recon/async: <urlopen error timed out>
-> http://<redacted>.9:6200/recon/async: <urlopen error timed out>
-> http://<redacted>.27:6200/recon/async: <urlopen error timed out>
-> http://<redacted>.4:6200/recon/async: <urlopen error timed out>
-> http://<redacted>.8:6200/recon/async: <urlopen error timed out>
Async stats: low: 243, high: 659, avg: 413, total: 132275
===============================================================================
[2012-03-10 16:57:48] Checking replication times on 384 hosts...
-> http://<redacted>.22:6200/recon/replication: <urlopen error timed out>
-> http://<redacted>.18:6200/recon/replication: <urlopen error timed out>
-> http://<redacted>.16:6200/recon/replication: <urlopen error timed out>
-> http://<redacted>.13:6200/recon/replication: <urlopen error timed out>
-> http://<redacted>.30:6200/recon/replication: <urlopen error timed out>
-> http://<redacted>.6:6200/recon/replication: <urlopen error timed out>
............
-> http://<redacted>.5:6200/recon/replication: <urlopen error timed out>
-> http://<redacted>.15:6200/recon/replication: <urlopen error timed out>
-> http://<redacted>.9:6200/recon/replication: <urlopen error timed out>
-> http://<redacted>.27:6200/recon/replication: <urlopen error timed out>
-> http://<redacted>.4:6200/recon/replication: <urlopen error timed out>
-> http://<redacted>.8:6200/recon/replication: <urlopen error timed out>
[Replication Times] shortest: 1.38144306739, longest: 112.620954418, avg: 10.285
9475361
===============================================================================
[2012-03-10 16:59:03] Checking load avg's on 384 hosts...
-> http://<redacted>.22:6200/recon/load: <urlopen error timed out>
-> http://<redacted>.18:6200/recon/load: <urlopen error timed out>
-> http://<redacted>.16:6200/recon/load: <urlopen error timed out>
-> http://<redacted>.13:6200/recon/load: <urlopen error timed out>
-> http://<redacted>.30:6200/recon/load: <urlopen error timed out>
-> http://<redacted>.6:6200/recon/load: <urlopen error timed out>
............
-> http://<redacted>.15:6200/recon/load: <urlopen error timed out>
-> http://<redacted>.9:6200/recon/load: <urlopen error timed out>
-> http://<redacted>.27:6200/recon/load: <urlopen error timed out>
-> http://<redacted>.4:6200/recon/load: <urlopen error timed out>
-> http://<redacted>.8:6200/recon/load: <urlopen error timed out>
[5m load average] lowest: 1.71, highest: 4.91, avg: 2.486375
[15m load average] lowest: 1.79, highest: 5.04, avg: 2.506125
[1m load average] lowest: 1.46, highest: 4.55, avg: 2.4929375
===============================================================================

Note

The replication times and load averages are within reasonable parameters, even with 80 object stores down. Async pendings, however is quite high. This is due to the fact that the containers on the servers which are down cannot be updated. When those servers come back up, async pendings should drop. If async pendings were at this level without an explanation, we have a problem.

Recon examples

Here is an example of noting and tracking down a problem with recon.

Running reccon shows some async pendings:

$ ssh -q <redacted>.132.7 sudo swift-recon -alr
===============================================================================
[2012-03-14 17:25:55] Checking async pendings on 384 hosts...
Async stats: low: 0, high: 23, avg: 8, total: 3356
===============================================================================
[2012-03-14 17:25:55] Checking replication times on 384 hosts...
[Replication Times] shortest: 1.49303831657, longest: 39.6982825994, avg: 4.2418222066
===============================================================================
[2012-03-14 17:25:56] Checking load avg's on 384 hosts...
[5m load average] lowest: 2.35, highest: 8.88, avg: 4.45911458333
[15m load average] lowest: 2.41, highest: 9.11, avg: 4.504765625
[1m load average] lowest: 1.95, highest: 8.56, avg: 4.40588541667
 ===============================================================================

Why? Running recon again with -av swift (not shown here) tells us that the node with the highest (23) is <redacted>.72.61. Looking at the log files on <redacted>.72.61 we see:

$ sudo tail -f /var/log/swift/background.log | - grep -i ERROR
Mar 14 17:28:06 <redacted> container-replicator ERROR Remote drive not mounted
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.119', 'id': 5481, 'meta': '', 'device': 'disk6', 'port': 6201}
Mar 14 17:28:06 <redacted> container-replicator ERROR Remote drive not mounted
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.119', 'id': 5481, 'meta': '', 'device': 'disk6', 'port': 6201}
Mar 14 17:28:09 <redacted> container-replicator ERROR Remote drive not mounted
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6201}
Mar 14 17:28:11 <redacted> container-replicator ERROR Remote drive not mounted
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6201}
Mar 14 17:28:13 <redacted> container-replicator ERROR Remote drive not mounted
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.119', 'id': 5481, 'meta': '', 'device': 'disk6', 'port': 6201}
Mar 14 17:28:13 <redacted> container-replicator ERROR Remote drive not mounted
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.119', 'id': 5481, 'meta': '', 'device': 'disk6', 'port': 6201}
Mar 14 17:28:15 <redacted> container-replicator ERROR Remote drive not mounted
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6201}
Mar 14 17:28:15 <redacted> container-replicator ERROR Remote drive not mounted
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6201}
Mar 14 17:28:19 <redacted> container-replicator ERROR Remote drive not mounted
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6201}
Mar 14 17:28:19 <redacted> container-replicator ERROR Remote drive not mounted
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6201}
Mar 14 17:28:20 <redacted> container-replicator ERROR Remote drive not mounted
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.119', 'id': 5481, 'meta': '', 'device': 'disk6', 'port': 6201}
Mar 14 17:28:21 <redacted> container-replicator ERROR Remote drive not mounted
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6201}
Mar 14 17:28:21 <redacted> container-replicator ERROR Remote drive not mounted
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6201}
Mar 14 17:28:22 <redacted> container-replicator ERROR Remote drive not mounted
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6201}

That is why this node has a lot of async pendings: a bunch of disks that are not mounted on <redacted> and <redacted>. There may be other issues, but clearing this up will likely drop the async pendings a fair bit, as other nodes will be having the same problem.

Assessing the availability risk when multiple storage servers are down

Note

This procedure will tell you if you have a problem, however, in practice you will find that you will not use this procedure frequently.

If three storage nodes (or, more precisely, three disks on three different storage nodes) are down, there is a small but nonzero probability that user objects, containers, or accounts will not be available.

Procedure

Note

swift has three rings: one each for objects, containers and accounts. This procedure should be run three times, each time specifying the appropriate *.builder file.

  1. Determine whether all three nodes are in different Swift zones by running the ring builder on a proxy node to determine which zones the storage nodes are in. For example:

    % sudo swift-ring-builder /etc/swift/object.builder
    /etc/swift/object.builder, build version 1467
    2097152 partitions, 3 replicas, 5 zones, 1320 devices, 0.02 balance
    The minimum number of hours before a partition can be reassigned is 24
    Devices:    id  zone     ip address    port     name  weight  partitions balance meta
                 0     1     <redacted>.4  6200     disk0 1708.00       4259   -0.00
                 1     1     <redacted>.4  6200     disk1 1708.00       4260    0.02
                 2     1     <redacted>.4  6200     disk2 1952.00       4868    0.01
                 3     1     <redacted>.4  6200     disk3 1952.00       4868    0.01
                 4     1     <redacted>.4  6200     disk4 1952.00       4867   -0.01
  2. Here, node <redacted>.4 is in zone 1. If two or more of the three nodes under consideration are in the same Swift zone, they do not have any ring partitions in common; there is little/no data availability risk if all three nodes are down.

  3. If the nodes are in three distinct Swift zones it is necessary to whether the nodes have ring partitions in common. Run swift-ring builder again, this time with the list_parts option and specify the nodes under consideration. For example:

    % sudo swift-ring-builder /etc/swift/object.builder list_parts <redacted>.8 <redacted>.15 <redacted>.72.2
    Partition   Matches
    91           2
    729          2
    3754         2
    3769         2
    3947         2
    5818         2
    7918         2
    8733         2
    9509         2
    10233        2
  4. The list_parts option to the ring builder indicates how many ring partitions the nodes have in common. If, as in this case, the first entry in the list has a 'Matches' column of 2 or less, there is no data availability risk if all three nodes are down.

  5. If the 'Matches' column has entries equal to 3, there is some data availability risk if all three nodes are down. The risk is generally small, and is proportional to the number of entries that have a 3 in the Matches column. For example:

    Partition   Matches
    26865          3
    362367         3
    745940         3
    778715         3
    797559         3
    820295         3
    822118         3
    839603         3
    852332         3
    855965         3
    858016         3
  6. A quick way to count the number of rows with 3 matches is:

    % sudo swift-ring-builder /etc/swift/object.builder list_parts <redacted>.8 <redacted>.15 <redacted>.72.2 | grep "3$" | wc -l
    
    30
  7. In this case the nodes have 30 out of a total of 2097152 partitions in common; about 0.001%. In this case the risk is small/nonzero. Recall that a partition is simply a portion of the ring mapping space, not actual data. So having partitions in common is a necessary but not sufficient condition for data unavailability.

    Note

    We should not bring down a node for repair if it shows Matches entries of 3 with other nodes that are also down.

    If three nodes that have 3 partitions in common are all down, there is a nonzero probability that data are unavailable and we should work to bring some or all of the nodes up ASAP.

Swift startup/shutdown

  • Use reload - not stop/start/restart.
  • Try to roll sets of servers (especially proxy) in groups of less than 20% of your servers.