Cleanup of Swift Ops Runbook

This patch cleans up some rough edges that were left (due to time constraints) in the original commit. Change-Id: Id4480be8dc1b5c920c19988cb89ca8b60ace91b4 Co-Authored-By: Gerry Drudy gerry.drudy@hpe.com
2016-03-09 14:28:17 +00:00 · 2016-03-09 14:28:17 +00:00 · e38b53393f
commit e38b53393f
parent 643dbce134
8 changed files with 517 additions and 562 deletions
--- a/doc/source/admin_guide.rst
+++ b/doc/source/admin_guide.rst
@ -234,9 +234,11 @@ using the format `regex_pattern_X = regex_expression`, where `X` is a number.
 This script has been tested on Ubuntu 10.04 and Ubuntu 12.04, so if you are
 using a different distro or OS, some care should be taken before using in production.

--------------
-Cluster Health
--------------
+.. _dispersion_report:
+
+-----------------
+Dispersion Report
+-----------------

 There is a swift-dispersion-report tool for measuring overall cluster health.
 This is accomplished by checking if a set of deliberately distributed
--- a/doc/source/ops_runbook/diagnose.rst
+++ b/doc/source/ops_runbook/diagnose.rst
@ -2,15 +2,53 @@
 Identifying issues and resolutions
 ==================================

+Is the system up?
+-----------------
+
+If you have a report that Swift is down, perform the following basic checks:
+
+#. Run swift functional tests.
+
+#. From a server in your data center, use ``curl`` to check ``/healthcheck``
+   (see below).
+
+#. If you have a monitoring system, check your monitoring system.
+
+#. Check your hardware load balancers infrastructure.
+
+#. Run swift-recon on a proxy node.
+
+Functional tests usage
+-----------------------
+
+We would recommend that you set up the functional tests to run against your
+production system. Run regularly this can be a useful tool to validate
+that the system is configured correctly. In addition, it can provide
+early warning about failures in your system (if the functional tests stop
+working, user applications will also probably stop working).
+
+A script for running the function tests is located in ``swift/.functests``.
+
+
+External monitoring
+-------------------
+
+We use pingdom.com to monitor the external Swift API. We suggest the
+following:
+
+   -  Do a GET on ``/healthcheck``
+
+   -  Create a container, make it public (x-container-read:
+      .r*,.rlistings), create a small file in the container; do a GET
+      on the object
+
 Diagnose: General approach
 --------------------------

 -  Look at service status in your monitoring system.

 -  In addition to system monitoring tools and issue logging by users,
-   swift errors will often result in log entries in the ``/var/log/swift``
-   files: ``proxy.log``, ``server.log`` and ``background.log`` (see:``Swift
-   logs``).
+   swift errors will often result in log entries (see :ref:`swift_logs`).

 -  Look at any logs your deployment tool produces.

@ -33,22 +71,24 @@ Diagnose: Swift-dispersion-report
 ---------------------------------

 The swift-dispersion-report is a useful tool to gauge the general
-health of the system. Configure the ``swift-dispersion`` report for
-100% coverage. The dispersion report regularly monitors
-these and gives a report of the amount of objects/containers are still
-available as well as how many copies of them are also there.
+health of the system. Configure the ``swift-dispersion`` report to cover at
+a minimum every disk drive in your system (usually 1% coverage).
+See :ref:`dispersion_report` for details of how to configure and
+use the dispersion reporting tool.

-The dispersion-report output is logged on the first proxy of the first
-AZ or each system (proxy with the monitoring role) under
-``/var/log/swift/swift-dispersion-report.log``.
+The ``swift-dispersion-report`` tool can take a long time to run, especially
+if any servers are down. We suggest you run it regularly
+(e.g., in a cron job) and save the results. This makes it easy to refer
+to the last report without having to wait for a long-running command
+to complete.

-Diagnose: Is swift running?
---------------------------
+Diagnose: Is system responding to /healthcheck?
+-----------------------------------------------

 When you want to establish if a swift endpoint is running, run ``curl -k``
-against either: https://*[REPLACEABLE]*./healthcheck OR
-https:*[REPLACEABLE]*.crossdomain.xml
+against https://*[ENDPOINT]*/healthcheck.

+.. _swift_logs:

 Diagnose: Interpreting messages in ``/var/log/swift/`` files
 ------------------------------------------------------------
@ -70,25 +110,20 @@ The following table lists known issues:
     - **Signature**
     - **Issue**
     - **Steps to take**
-   * - /var/log/syslog
-     - kernel: [] hpsa .... .... .... has check condition: unknown type:
-       Sense: 0x5, ASC: 0x20, ASC Q: 0x0 ....
-     - An unsupported command was issued to the storage hardware
-     - Understood to be a benign monitoring issue, ignore
   * - /var/log/syslog
     - kernel: [] sd .... [csbu:sd...] Sense Key: Medium Error
     - Suggests disk surface issues
-     - Run swift diagnostics on the target node to check for disk errors,
+     - Run ``swift-drive-audit`` on the target node to check for disk errors,
       repair disk errors
   * - /var/log/syslog
     - kernel: [] sd .... [csbu:sd...] Sense Key: Hardware Error
     - Suggests storage hardware issues
-     - Run swift diagnostics on the target node to check for disk failures,
+     - Run diagnostics on the target node to check for disk failures,
       replace failed disks
   * - /var/log/syslog
     - kernel: [] .... I/O error, dev sd.... ,sector ....
     -
-     - Run swift diagnostics on the target node to check for disk errors
+     - Run diagnostics on the target node to check for disk errors
   * - /var/log/syslog
     - pound: NULL get_thr_arg
     - Multiple threads woke up
@ -96,59 +131,61 @@ The following table lists known issues:
   * - /var/log/swift/proxy.log
     - .... ERROR .... ConnectionTimeout ....
     - A storage node is not responding in a timely fashion
-     - Run swift diagnostics on the target node to check for node down,
-       node unconfigured, storage off-line or network issues between the
+     - Check if node is down, not running Swift,
+       unconfigured, storage off-line or for network issues between the
       proxy and non responding node
   * - /var/log/swift/proxy.log
     - proxy-server .... HTTP/1.0 500 ....
     - A proxy server has reported an internal server error
-     - Run swift diagnostics on the target node to check for issues
+     - Examine the logs for any errors at the time the error was reported to
+       attempt to understand the cause of the error.
   * - /var/log/swift/server.log
     - .... ERROR .... ConnectionTimeout ....
     - A storage server is not responding in a timely fashion
-     - Run swift diagnostics on the target node to check for a node or
-       service, down, unconfigured, storage off-line or network issues
-       between the two nodes
+     - Check if node is down, not running Swift,
+       unconfigured, storage off-line or for network issues between the
+       server and non responding node
   * - /var/log/swift/server.log
     - .... ERROR .... Remote I/O error: '/srv/node/disk....
     - A storage device is not responding as expected
-     - Run swift diagnostics and check the filesystem named in the error
-       for corruption (unmount & xfs_repair)
+     - Run ``swift-drive-audit`` and check the filesystem named in the error
+       for corruption (unmount & xfs_repair). Check if the filesystem
+       is mounted and working.
   * - /var/log/swift/background.log
     - object-server ERROR container update failed .... Connection refused
-     - Peer node is not responding
-     - Check status of the network and peer node
+     - A container server node could not be contacted
+     - Check if node is down, not running Swift,
+       unconfigured, storage off-line or for network issues between the
+       server and non responding node
   * - /var/log/swift/background.log
     - object-updater ERROR with remote .... ConnectionTimeout
-     -
-     - Check status of the network and peer node
+     - The remote container server is busy
+     - If the container is very large, some errors updating it can be
+       expected. However, this error can also occur if there is a networking
+       issue.
   * - /var/log/swift/background.log
     - account-reaper STDOUT: .... error: ECONNREFUSED
-     - Network connectivity issue
-     - Resolve network issue and re-run diagnostics
+     - Network connectivity issue or the target server is down.
+     - Resolve network issue or reboot the target server
   * - /var/log/swift/background.log
     - .... ERROR .... ConnectionTimeout
     - A storage server is not responding in a timely fashion
-     - Run swift diagnostics on the target node to check for a node
-       or service, down, unconfigured, storage off-line or network issues
-       between the two nodes
+     - The target server may be busy. However, this error can also occur if
+       there is a networking issue.
   * - /var/log/swift/background.log
     - .... ERROR syncing .... Timeout
-     - A storage server is not responding in a timely fashion
-     - Run swift diagnostics on the target node to check for a node
-       or service, down, unconfigured, storage off-line or network issues
-       between the two nodes
+     - A timeout occurred syncing data to another node.
+     - The target server may be busy. However, this error can also occur if
+       there is a networking issue.
   * - /var/log/swift/background.log
     - .... ERROR Remote drive not mounted ....
     - A storage server disk is unavailable
-     - Run swift diagnostics on the target node to check for a node or
-       service, failed or unmounted disk on the target, or a network issue
+     - Repair and remount the file system (on the remote node)
   * - /var/log/swift/background.log
     - object-replicator .... responded as unmounted
     - A storage server disk is unavailable
-     - Run swift diagnostics on the target node to check for a node or
-       service, failed or unmounted disk on the target, or a network issue
-   * - /var/log/swift/\*.log
+     - Repair and remount the file system (on the remote node)
+   * - /var/log/swift/*.log
     - STDOUT: EXCEPTION IN
     - A unexpected error occurred
     - Read the Traceback details, if it matches known issues
@ -157,19 +194,14 @@ The following table lists known issues:
   * - /var/log/rsyncd.log
     - rsync: mkdir "/disk....failed: No such file or directory....
     - A local storage server disk is unavailable
-     - Run swift diagnostics on the node to check for a failed or
+     - Run diagnostics on the node to check for a failed or
       unmounted disk
   * - /var/log/swift*
-     - Exception: Could not bind to 0.0.0.0:600xxx
+     - Exception: Could not bind to 0.0.0.0:6xxx
     - Possible Swift process restart issue. This indicates an old swift
       process is still running.
-     - Run swift diagnostics, if some swift services are reported down,
+     - Restart Swift services. If some swift services are reported down,
       check if they left residual process behind.
-   * - /var/log/rsyncd.log
-     - rsync: recv_generator: failed to stat "/disk....." (in object)
-       failed: Not a directory (20)
-     - Swift directory structure issues
-     - Run swift diagnostics on the node to check for issues

 Diagnose: Parted reports the backup GPT table is corrupt
 --------------------------------------------------------
@ -188,7 +220,7 @@ Diagnose: Parted reports the backup GPT table is corrupt

      OK/Cancel?

-To fix, go to: Fix broken GPT table (broken disk partition)
+To fix, go to :ref:`fix_broken_gpt_table`


 Diagnose: Drives diagnostic reports a FS label is not acceptable
@ -240,9 +272,10 @@ Diagnose: Failed LUNs

 .. note::

-   The HPE Helion Public Cloud uses direct attach SmartArry
+   The HPE Helion Public Cloud uses direct attach SmartArray
   controllers/drives. The information here is specific to that
-   environment.
+   environment. The hpacucli utility mentioned here may be called
+   hpssacli in your environment.

 The ``swift_diagnostics`` mount checks may return a warning that a LUN has
 failed, typically accompanied by DriveAudit check failures and device
@ -254,7 +287,7 @@ the procedure to replace the disk.

 Otherwise the lun can be re-enabled as follows:

-#. Generate a hpssacli diagnostic report. This report allows the swift
+#. Generate a hpssacli diagnostic report. This report allows the DC
   team to troubleshoot potential cabling or hardware issues so it is
   imperative that you run it immediately when troubleshooting a failed
   LUN. You will come back later and grep this file for more details, but
@ -262,8 +295,7 @@ Otherwise the lun can be re-enabled as follows:

   .. code::

-      sudo hpssacli controller all diag file=/tmp/hpacu.diag ris=on \
-      xml=off zip=off
+      sudo hpssacli controller all diag file=/tmp/hpacu.diag ris=on xml=off zip=off

 Export the following variables using the below instructions before
 proceeding further.
@ -317,8 +349,7 @@ proceeding further.

   .. code::

-      sudo hpssacli controller slot=1 ld ${LDRIVE} show detail \
-      grep -i "Disk Name"
+      sudo hpssacli controller slot=1 ld ${LDRIVE} show detail | grep -i "Disk Name"

 #. Export the device name variable from the preceding command (example:
   /dev/sdk):
@ -396,6 +427,8 @@ proceeding further.
   should be checked. For example, log a DC ticket to check the sas cables
   between the drive and the expander.

+.. _diagnose_slow_disk_drives:
+
 Diagnose: Slow disk devices
 ---------------------------

@ -404,7 +437,8 @@ Diagnose: Slow disk devices
   collectl is an open-source performance gathering/analysis tool.

 If the diagnostics report a message such as ``sda: drive is slow``, you
-should log onto the node and run the following comand:
+should log onto the node and run the following command (remove ``-c 1`` option to continuously monitor
+the data):

 .. code::

@ -431,13 +465,12 @@ should log onto the node and run the following comand:
   dm-3             0      0    0    0       0      0    0    0       0     0     0      0    0
   dm-4             0      0    0    0       0      0    0    0       0     0     0      0    0
   dm-5             0      0    0    0       0      0    0    0       0     0     0      0    0
-   ...
-   (repeats -- type Ctrl/C to stop)
+

 Look at the ``Wait`` and ``SvcTime`` values. It is not normal for
 these values to exceed 50msec. This is known to impact customer
-performance (upload/download. For a controller problem, many/all drives
-will show how wait and service times. A reboot may correct the prblem;
+performance (upload/download). For a controller problem, many/all drives
+will show long wait and service times. A reboot may correct the problem;
 otherwise hardware replacement is needed.

 Another way to look at the data is as follows:
@ -526,12 +559,12 @@ be disabled on a per-drive basis.
 Diagnose: Slow network link - Measuring network performance
 -----------------------------------------------------------

-Network faults can cause performance between Swift nodes to degrade. The
-following tests are recommended. Other methods (such as copying large
+Network faults can cause performance between Swift nodes to degrade. Testing
+with ``netperf`` is recommended. Other methods (such as copying large
 files) may also work, but can produce inconclusive results.

-Use netperf on all production systems. Install on all systems if not
-already installed. And the UFW rules for its control port are in place.
+Install ``netperf`` on all systems if not
+already installed. Check that the UFW rules for its control port are in place.
 However, there are no pre-opened ports for netperf's data connection. Pick a
 port number. In this example, 12866 is used because it is one higher
 than netperf's default control port number, 12865. If you get very
@ -561,11 +594,11 @@ Running tests

 #. On the ``source`` node, run the following command to check
   throughput. Note the double-dash before the -P option.
-   The command takes 10 seconds to complete.
+   The command takes 10 seconds to complete. The ``target`` node is 192.168.245.5.

   .. code::

-      $ netperf -H <redacted>.72.4
+      $ netperf -H 192.168.245.5 -- -P 12866
      MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 12866 AF_INET to
      <redacted>.72.4 (<redacted>.72.4) port 12866 AF_INET : demo
      Recv   Send    Send
@ -578,7 +611,7 @@ Running tests

   .. code::

-      $ netperf -H <redacted>.72.4 -t TCP_RR -- -P 12866
+      $ netperf -H 192.168.245.5 -t TCP_RR -- -P 12866
      MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12866
      AF_INET to <redacted>.72.4 (<redacted>.72.4) port 12866 AF_INET : demo
      : first burst 0
@ -763,7 +796,7 @@ Diagnose: High system latency
   used by the monitor program happen to live on the bad object server.

 -  A general network problem within the data canter. Compare the results
-   with the Pingdom monitors too see if they also have a problem.
+   with the Pingdom monitors to see if they also have a problem.

 Diagnose: Interface reports errors
 ----------------------------------
@ -802,59 +835,21 @@ If the nick supports self test, this can be performed with:
 Self tests should read ``PASS`` if the nic is operating correctly.

 Nic module drivers can be re-initialised by carefully removing and
-re-installing the modules. Case in point being the mellanox drivers on
-Swift Proxy servers. which use a two part driver mlx4_en and
+re-installing the modules (this avoids rebooting the server).
+For example, mellanox drivers use a two part driver mlx4_en and
 mlx4_core. To reload these you must carefully remove the mlx4_en
 (ethernet) then the mlx4_core modules, and reinstall them in the
 reverse order.

 As the interface will be disabled while the modules are unloaded, you
-must be very careful not to lock the interface out. The following
-script can be used to reload the melanox drivers, as a side effect, this
-resets error counts on the interface.
-
-
-Diagnose: CorruptDir diagnostic reports corrupt directories
-----------------------------------------------------------
-
-From time to time Swift data structures may become corrupted by
-misplaced files in filesystem locations that swift would normally place
-a directory. This causes issues for swift when directory creation is
-attempted at said location, it may fail due to the pre-existent file. If
-the CorruptDir diagnostic reports Corrupt directories, they should be
-checked to see if they exist.
-
-Checking existence of entries
-----------------------------
-
-Swift data filesystems are located under the ``/srv/node/disk``
-mountpoints and contain accounts, containers and objects
-subdirectories which in turn contain partition number subdirectories.
-The partition number directories contain md5 hash subdirectories. md5
-hash directories contain md5sum subdirectories. md5sum directories
-contain the Swift data payload as either a database (.db), for
-accounts and containers, or a data file (.data) for objects.
-If the entries reported in diagnostics correspond to a partition
-number, md5 hash or md5sum directory, check the entry with ``ls
-ld *entry*``.
-If it turns out to be a file rather than a directory, it should be
-carefully removed.
-
-.. note::
-
-   Please do not ``ls`` the partition level directory contents, as
-   this *especially objects* may take a lot of time and system resources,
-   if you need to check the contents, use:
-
-   .. code::
-
-      echo /srv/node/disk#/type/partition#/
+must be very careful not to lock yourself out so it may be better
+to script this.

 Diagnose: Hung swift object replicator
 --------------------------------------

-The swift diagnostic message ``Object replicator: remaining exceeds
-100hrs:`` may indicate that the swift ``object-replicator`` is stuck and not
+A replicator reports in its log that remaining time exceeds
+100 hours. This may indicate that the swift ``object-replicator`` is stuck and not
 making progress. Another useful way to check this is with the
 'swift-recon -r' command on a swift proxy server:

@ -866,14 +861,13 @@ making progress. Another useful way to check this is with the
   --> Starting reconnaissance on 384 hosts
   ===============================================================================
   [2013-07-17 12:56:19] Checking on replication
-   http://<redacted>.72.63:6000/recon/replication: <urlopen error timed out>
   [replication_time] low: 2, high: 80, avg: 28.8, total: 11037, Failed: 0.0%, no_result: 0, reported: 383
-   Oldest completion was 2013-06-12 22:46:50 (12 days ago) by <redacted>.31:6000.
-   Most recent completion was 2013-07-17 12:56:19 (5 seconds ago) by <redacted>.204.113:6000.
+   Oldest completion was 2013-06-12 22:46:50 (12 days ago) by 192.168.245.3:6000.
+   Most recent completion was 2013-07-17 12:56:19 (5 seconds ago) by 192.168.245.5:6000.
   ===============================================================================

 The ``Oldest completion`` line in this example indicates that the
-object-replicator on swift object server <redacted>.31 has not completed
+object-replicator on swift object server 192.168.245.3 has not completed
 the replication cycle in 12 days. This replicator is stuck. The object
 replicator cycle is generally less than 1 hour. Though an replicator
 cycle of 15-20 hours can occur if nodes are added to the system and a
@ -886,22 +880,22 @@ the following command:
 .. code::

   #  sudo grep object-rep /var/log/swift/background.log | grep -e "Starting object replication" -e "Object replication complete" -e "partitions rep"
-   Jul 16 06:25:46 <redacted> object-replicator 15344/16450 (93.28%) partitions replicated in 69018.48s (0.22/sec, 22h remaining)
-   Jul 16 06:30:46 <redacted> object-replicator 15344/16450 (93.28%) partitions replicated in 69318.58s (0.22/sec, 22h remaining)
-   Jul 16 06:35:46 <redacted> object-replicator 15344/16450 (93.28%) partitions replicated in 69618.63s (0.22/sec, 23h remaining)
-   Jul 16 06:40:46 <redacted> object-replicator 15344/16450 (93.28%) partitions replicated in 69918.73s (0.22/sec, 23h remaining)
-   Jul 16 06:45:46 <redacted> object-replicator 15348/16450 (93.30%) partitions replicated in 70218.75s (0.22/sec, 24h remaining)
-   Jul 16 06:50:47 <redacted> object-replicator 15348/16450 (93.30%) partitions replicated in 70518.85s (0.22/sec, 24h remaining)
-   Jul 16 06:55:47 <redacted> object-replicator 15348/16450 (93.30%) partitions replicated in 70818.95s (0.22/sec, 25h remaining)
-   Jul 16 07:00:47 <redacted> object-replicator 15348/16450 (93.30%) partitions replicated in 71119.05s (0.22/sec, 25h remaining)
-   Jul 16 07:05:47 <redacted> object-replicator 15348/16450 (93.30%) partitions replicated in 71419.15s (0.21/sec, 26h remaining)
-   Jul 16 07:10:47 <redacted> object-replicator 15348/16450 (93.30%) partitions replicated in 71719.25s (0.21/sec, 26h remaining)
-   Jul 16 07:15:47 <redacted> object-replicator 15348/16450 (93.30%) partitions replicated in 72019.27s (0.21/sec, 27h remaining)
-   Jul 16 07:20:47 <redacted> object-replicator 15348/16450 (93.30%) partitions replicated in 72319.37s (0.21/sec, 27h remaining)
-   Jul 16 07:25:47 <redacted> object-replicator 15348/16450 (93.30%) partitions replicated in 72619.47s (0.21/sec, 28h remaining)
-   Jul 16 07:30:47 <redacted> object-replicator 15348/16450 (93.30%) partitions replicated in 72919.56s (0.21/sec, 28h remaining)
-   Jul 16 07:35:47 <redacted> object-replicator 15348/16450 (93.30%) partitions replicated in 73219.67s (0.21/sec, 29h remaining)
-   Jul 16 07:40:47 <redacted> object-replicator 15348/16450 (93.30%) partitions replicated in 73519.76s (0.21/sec, 29h remaining)
+   Jul 16 06:25:46 192.168.245.4 object-replicator 15344/16450 (93.28%) partitions replicated in 69018.48s (0.22/sec, 22h remaining)
+   Jul 16 06:30:46 192.168.245.4object-replicator 15344/16450 (93.28%) partitions replicated in 69318.58s (0.22/sec, 22h remaining)
+   Jul 16 06:35:46 192.168.245.4 object-replicator 15344/16450 (93.28%) partitions replicated in 69618.63s (0.22/sec, 23h remaining)
+   Jul 16 06:40:46 192.168.245.4 object-replicator 15344/16450 (93.28%) partitions replicated in 69918.73s (0.22/sec, 23h remaining)
+   Jul 16 06:45:46 192.168.245.4 object-replicator 15348/16450 (93.30%) partitions replicated in 70218.75s (0.22/sec, 24h remaining)
+   Jul 16 06:50:47 192.168.245.4object-replicator 15348/16450 (93.30%) partitions replicated in 70518.85s (0.22/sec, 24h remaining)
+   Jul 16 06:55:47 192.168.245.4 object-replicator 15348/16450 (93.30%) partitions replicated in 70818.95s (0.22/sec, 25h remaining)
+   Jul 16 07:00:47 192.168.245.4 object-replicator 15348/16450 (93.30%) partitions replicated in 71119.05s (0.22/sec, 25h remaining)
+   Jul 16 07:05:47 192.168.245.4 object-replicator 15348/16450 (93.30%) partitions replicated in 71419.15s (0.21/sec, 26h remaining)
+   Jul 16 07:10:47 192.168.245.4object-replicator 15348/16450 (93.30%) partitions replicated in 71719.25s (0.21/sec, 26h remaining)
+   Jul 16 07:15:47 192.168.245.4 object-replicator 15348/16450 (93.30%) partitions replicated in 72019.27s (0.21/sec, 27h remaining)
+   Jul 16 07:20:47 192.168.245.4object-replicator 15348/16450 (93.30%) partitions replicated in 72319.37s (0.21/sec, 27h remaining)
+   Jul 16 07:25:47 192.168.245.4 object-replicator 15348/16450 (93.30%) partitions replicated in 72619.47s (0.21/sec, 28h remaining)
+   Jul 16 07:30:47 192.168.245.4 object-replicator 15348/16450 (93.30%) partitions replicated in 72919.56s (0.21/sec, 28h remaining)
+   Jul 16 07:35:47 192.168.245.4 object-replicator 15348/16450 (93.30%) partitions replicated in 73219.67s (0.21/sec, 29h remaining)
+   Jul 16 07:40:47 192.168.245.4 object-replicator 15348/16450 (93.30%) partitions replicated in 73519.76s (0.21/sec, 29h remaining)

 The above status is output every 5 minutes to ``/var/log/swift/background.log``.

@ -921,7 +915,7 @@ of a corrupted filesystem detected by the object replicator:
 .. code::

   # sudo bzgrep "Remote I/O error" /var/log/swift/background.log* |grep srv | - tail -1
-   Jul 12 03:33:30 <redacted> object-replicator STDOUT: ERROR:root:Error hashing suffix#012Traceback (most recent call last):#012 File
+   Jul 12 03:33:30 192.168.245.4 object-replicator STDOUT: ERROR:root:Error hashing suffix#012Traceback (most recent call last):#012 File
   "/usr/lib/python2.7/dist-packages/swift/obj/replicator.py", line 199, in get_hashes#012 hashes[suffix] = hash_suffix(suffix_dir,
   reclaim_age)#012 File "/usr/lib/python2.7/dist-packages/swift/obj/replicator.py", line 84, in hash_suffix#012 path_contents =
   sorted(os.listdir(path))#012OSError: [Errno 121] Remote I/O error: '/srv/node/disk4/objects/1643763/b51'
@ -996,7 +990,7 @@ to repair the problem filesystem.
      # sudo xfs_repair -P /dev/sde1

 #. If the ``xfs_repair`` fails then it may be necessary to re-format the
-   filesystem. See Procedure: fix broken XFS filesystem. If the
+   filesystem. See :ref:`fix_broken_xfs_filesystem`. If the
   ``xfs_repair`` is successful, re-enable chef using the following command
   and replication should commence again.

@ -1025,7 +1019,183 @@ load:
   $ uptime
   07:44:02 up 18:22,  1 user,  load average: 407.12, 406.36, 404.59

-.. toctree::
-   :maxdepth: 2
+Further issues and resolutions
+------------------------------
+
+.. note::
+
+   The urgency levels in each **Action** column indicates whether or
+   not it is required to take immediate action, or if the problem can be worked
+   on during business hours.
+
+.. list-table::
+   :widths: 33 33 33
+   :header-rows: 1
+
+   * - **Scenario**
+     - **Description**
+     - **Action**
+   * - ``/healthcheck`` latency is high.
+     - The ``/healthcheck`` test does not tax the proxy very much so any drop in value is probably related to
+       network issues, rather than the proxies being very busy. A very slow proxy might impact the average
+       number, but it would need to be very slow to shift the number that much.
+     - Check networks. Do a ``curl https://<ip-address>:<port>/healthcheck`` where
+       ``ip-address`` is individual proxy IP address.
+       Repeat this for every proxy server to see if you can pin point the problem.
+
+       Urgency: If there are other indications that your system is slow, you should treat
+       this as an urgent problem.
+   * - Swift process is not running.
+     - You can use ``swift-init`` status to check if swift processes are running on any
+       given server.
+     - Run this command:
+
+       .. code::
+
+          sudo swift-init all start
+
+       Examine messages in the swift log files to see if there are any
+       error messages related to any of the swift processes since the time you
+       ran the ``swift-init`` command.
+
+       Take any corrective actions that seem necessary.
+
+       Urgency: If this only affects one server, and you have more than one,
+       identifying and fixing the problem can wait until business hours.
+       If this same problem affects many servers, then you need to take corrective
+       action immediately.
+   * - ntpd is not running.
+     - NTP is not running.
+     - Configure and start NTP.
+
+       Urgency: For proxy servers, this is vital.
+
+   * - Host clock is not syncd to an NTP server.
+     - Node time settings does not match NTP server time.
+       This may take some time to sync after a reboot.
+     - Assuming NTP is configured and running, you have to wait until the times sync.
+   * - A swift process has hundreds, to thousands of open file descriptors.
+     - May happen to any of the swift processes.
+       Known to have happened with a ``rsyslod`` restart and where ``/tmp`` was hanging.
+
+     - Restart the swift processes on the affected node:
+
+       .. code::
+
+          % sudo swift-init all reload
+
+       Urgency:
+                If known performance problem: Immediate
+
+                If system seems fine: Medium
+   * - A swift process is not owned by the swift user.
+     - If the UID of the swift user has changed, then the processes might not be
+       owned by that UID.
+     - Urgency: If this only affects one server, and you have more than one,
+       identifying and fixing the problem can wait until business hours.
+       If this same problem affects many servers, then you need to take corrective
+       action immediately.
+   * - Object account or container files not owned by swift.
+     - This typically happens if during a reinstall or a re-image of a server that the UID
+       of the swift user was changed. The data files in the object account and container
+       directories are owned by the original swift UID. As a result, the current swift
+       user does not own these files.
+     - Correct the UID of the swift user to reflect that of the original UID. An alternate
+       action is to change the ownership of every file on all file systems. This alternate
+       action is often impractical and will take considerable time.
+
+       Urgency: If this only affects one server, and you have more than one,
+       identifying and fixing the problem can wait until business hours.
+       If this same problem affects many servers, then you need to take corrective
+       action immediately.
+   * - A disk drive has a high IO wait or service time.
+     - If high wait IO times are seen for a single disk, then the disk drive is the problem.
+       If most/all devices are slow, the controller is probably the source of the problem.
+       The controller cache may also be miss configured – which will cause similar long
+       wait or service times.
+     - As a first step, if your controllers have a cache, check that it is enabled and their battery/capacitor
+       is working.
+
+       Second, reboot the server.
+       If problem persists, file a DC ticket to have the drive or controller replaced.
+       See :ref:`diagnose_slow_disk_drives` on how to check the drive wait or service times.
+
+       Urgency: Medium
+   * - The network interface is not up.
+     - Use the ``ifconfig`` and ``ethtool`` commands to determine the network state.
+     - You can try restarting the interface. However, generally the interface
+       (or cable) is probably broken, especially if the interface is flapping.
+
+       Urgency: If this only affects one server, and you have more than one,
+       identifying and fixing the problem can wait until business hours.
+       If this same problem affects many servers, then you need to take corrective
+       action immediately.
+   * - Network interface card (NIC) is not operating at the expected speed.
+     - The NIC is running at a slower speed than its nominal rated speed.
+       For example, it is running at 100 Mb/s and the NIC is a 1Ge NIC.
+     - 1. Try resetting the interface with:
+
+       .. code::
+
+          sudo ethtool -s eth0 speed 1000
+
+       ... and then run:
+
+       .. code::
+
+          sudo lshw -class
+
+       See if size goes to the expected speed. Failing
+       that, check hardware (NIC cable/switch port).
+
+       2. If persistent, consider shutting down the server (especially if a proxy)
+          until the problem is identified and resolved. If you leave this server
+          running it can have a large impact on overall performance.
+
+       Urgency: High
+   * - The interface RX/TX error count is non-zero.
+     - A value of 0 is typical, but counts of 1 or 2 do not indicate a problem.
+     - 1. For low numbers (For example, 1 or 2), you can simply ignore. Numbers in the range
+          3-30 probably indicate that the error count has crept up slowly over a long time.
+          Consider rebooting the server to remove the report from the noise.
+
+          Typically, when a cable or interface is bad, the error count goes to 400+. For example,
+          it stands out. There may be other symptoms such as the interface going up and down or
+          not running at correct speed. A server with a high error count should be watched.
+
+       2. If the error count continues to climb, consider taking the server down until
+          it can be properly investigated. In any case, a reboot should be done to clear
+          the error count.
+
+       Urgency: High, if the error count increasing.
+
+   * - In a swift log you see a message that a process has not replicated in over 24 hours.
+     - The replicator has not successfully completed a run in the last 24 hours.
+       This indicates that the replicator has probably hung.
+     - Use ``swift-init`` to stop and then restart the replicator process.
+
+       Urgency: Low. However if you
+       recently added or replaced disk drives then you should treat this urgently.
+   * - Container Updater has not run in 4 hour(s).
+     - The service may appear to be running however, it may be hung. Examine their swift
+       logs to see if there are any error messages relating to the container updater. This
+       may potentially explain why the container is not running.
+     - Urgency: Medium
+       This may have been triggered by a recent restart of the  rsyslog daemon.
+       Restart the service with:
+       .. code::
+
+          sudo swift-init <service> reload
+   * - Object replicator: Reports the remaining time and that time is more than 100 hours.
+     - Each replication cycle the object replicator writes a log message to its log
+       reporting statistics about the current cycle. This includes an estimate for the
+       remaining time needed to replicate all objects. If this time is longer than
+       100 hours, there is a problem with the replication process.
+     - Urgency: Medium
+       Restart the service with:
+       .. code::
+
+          sudo swift-init object-replicator reload
+
+       Check that the remaining replication time is going down.

-   sec-furtherdiagnose.rst
--- a/doc/source/ops_runbook/general.rst
+++ b/doc/source/ops_runbook/general.rst
@ -1,36 +0,0 @@
-==================
-General Procedures
-==================
-
-Getting a swift account stats
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. note::
-
-   ``swift-direct`` is specific to the HPE Helion Public Cloud. Go look at
-   ``swifty`` for an alternate, this is an example.
-
-This procedure describes how you determine the swift usage for a given
-swift account, that is the number of containers, number of objects and
-total bytes used. To do this you will need the project ID.
-
-Log onto one of the swift proxy servers.
-
-Use swift-direct to show this accounts usage:
-
-.. code::
-
-   $ sudo -u swift /opt/hp/swift/bin/swift-direct show AUTH_redacted-9a11-45f8-aa1c-9e7b1c7904c8
-   Status: 200
-         Content-Length: 0
-         Accept-Ranges: bytes
-         X-Timestamp: 1379698586.88364
-         X-Account-Bytes-Used: 67440225625994
-         X-Account-Container-Count: 1
-         Content-Type: text/plain; charset=utf-8
-         X-Account-Object-Count: 8436776
-         Status: 200
-         name: my_container  count: 8436776  bytes: 67440225625994
-
-This account has 1 container. That container has 8436776 objects. The
-total bytes used is 67440225625994.
--- a/doc/source/ops_runbook/index.rst
+++ b/doc/source/ops_runbook/index.rst
@ -13,67 +13,15 @@ information, suggestions or recommendations. This document are provided
 for reference only. We are not responsible for your use of any
 information, suggestions or recommendations contained herein.

-This document also contains references to certain tools that we use to
-operate the Swift system within the HPE Helion Public Cloud.
-Descriptions of these tools are provided for reference only, as the tools themselves
-are not publically available at this time.
-
-  ``swift-direct``: This is similar to the ``swiftly`` tool.
-

 .. toctree::
   :maxdepth: 2

-   general.rst
   diagnose.rst
   procedures.rst
   maintenance.rst
   troubleshooting.rst

-Is the system up?
-~~~~~~~~~~~~~~~~~
-
-If you have a report that Swift is down, perform the following basic checks:
-
-#. Run swift functional tests.
-
-#. From a server in your data center, use ``curl`` to check ``/healthcheck``.
-
-#. If you have a monitoring system, check your monitoring system.
-
-#. Check on your hardware load balancers infrastructure.
-
-#. Run swift-recon on a proxy node.
-
-Run swift function tests
------------------------
-
-We would recommend that you set up your function tests against your production
-system.
-
-A script for running the function tests is located in ``swift/.functests``.


-External monitoring
-------------------
-
-  We use pingdom.com to monitor the external Swift API. We suggest the
-   following:
-
-   -  Do a GET on ``/healthcheck``
-
-   -  Create a container, make it public (x-container-read:
-      .r\*,.rlistings), create a small file in the container; do a GET
-      on the object
-
-Reference information
-~~~~~~~~~~~~~~~~~~~~~
-
-Reference: Swift startup/shutdown
---------------------------------
-
-  Use reload - not stop/start/restart.
-
-  Try to roll sets of servers (especially proxy) in groups of less
-   than 20% of your servers.

--- a/doc/source/ops_runbook/maintenance.rst
+++ b/doc/source/ops_runbook/maintenance.rst
@ -54,8 +54,8 @@ system. Rules-of-thumb for 'good' recon output are:

   .. code::

-      \-> [http://<redacted>.29:6000/recon/load:] <urlopen error [Errno 111] ECONNREFUSED>
-      \-> [http://<redacted>.31:6000/recon/load:] <urlopen error timed out>
+      -> [http://<redacted>.29:6000/recon/load:] <urlopen error [Errno 111] ECONNREFUSED>
+      -> [http://<redacted>.31:6000/recon/load:] <urlopen error timed out>

 -  That could be okay or could require investigation.

@ -154,18 +154,18 @@ Running reccon shows some async pendings:

 .. code::

-   bob@notso:~/swift-1.4.4/swift$ ssh \\-q <redacted>.132.7 sudo swift-recon \\-alr
+   bob@notso:~/swift-1.4.4/swift$ ssh -q <redacted>.132.7 sudo swift-recon -alr
   ===============================================================================
-   \[2012-03-14 17:25:55\\] Checking async pendings on 384 hosts...
+   [2012-03-14 17:25:55] Checking async pendings on 384 hosts...
   Async stats: low: 0, high: 23, avg: 8, total: 3356
   ===============================================================================
-   \[2012-03-14 17:25:55\\] Checking replication times on 384 hosts...
-   \[Replication Times\\] shortest: 1.49303831657, longest: 39.6982825994, avg: 4.2418222066
+   [2012-03-14 17:25:55] Checking replication times on 384 hosts...
+   [Replication Times] shortest: 1.49303831657, longest: 39.6982825994, avg: 4.2418222066
   ===============================================================================
-   \[2012-03-14 17:25:56\\] Checking load avg's on 384 hosts...
-   \[5m load average\\] lowest: 2.35, highest: 8.88, avg: 4.45911458333
-   \[15m load average\\] lowest: 2.41, highest: 9.11, avg: 4.504765625
-   \[1m load average\\] lowest: 1.95, highest: 8.56, avg: 4.40588541667
+   [2012-03-14 17:25:56] Checking load avg's on 384 hosts...
+   [5m load average] lowest: 2.35, highest: 8.88, avg: 4.45911458333
+   [15m load average] lowest: 2.41, highest: 9.11, avg: 4.504765625
+   [1m load average] lowest: 1.95, highest: 8.56, avg: 4.40588541667
    ===============================================================================

 Why? Running recon again with -av swift (not shown here) tells us that
@ -231,7 +231,7 @@ Procedure
   This procedure should be run three times, each time specifying the
   appropriate ``*.builder`` file.

-#. Determine whether all three nodes are different Swift zones by
+#. Determine whether all three nodes are in different Swift zones by
   running the ring builder on a proxy node to determine which zones
   the storage nodes are in. For example:

@ -241,22 +241,22 @@ Procedure
      /etc/swift/object.builder, build version 1467
      2097152 partitions, 3 replicas, 5 zones, 1320 devices, 0.02 balance
      The minimum number of hours before a partition can be reassigned is 24
-      Devices:    id  zone      ip address  port      name weight partitions balance meta
-      0     1     <redacted>.4  6000     disk0 1708.00       4259   -0.00
-      1     1     <redacted>.4  6000     disk1 1708.00       4260    0.02
-      2     1     <redacted>.4  6000     disk2 1952.00       4868    0.01
-      3     1     <redacted>.4  6000     disk3 1952.00       4868    0.01
-      4     1     <redacted>.4  6000     disk4 1952.00       4867   -0.01
+      Devices:    id  zone     ip address    port     name  weight  partitions balance meta
+                   0     1     <redacted>.4  6000     disk0 1708.00       4259   -0.00
+                   1     1     <redacted>.4  6000     disk1 1708.00       4260    0.02
+                   2     1     <redacted>.4  6000     disk2 1952.00       4868    0.01
+                   3     1     <redacted>.4  6000     disk3 1952.00       4868    0.01
+                   4     1     <redacted>.4  6000     disk4 1952.00       4867   -0.01

 #. Here, node <redacted>.4 is in zone 1. If two or more of the three
   nodes under consideration are in the same Swift zone, they do not
   have any ring partitions in common; there is little/no data
   availability risk if all three nodes are down.

-#. If the nodes are in three distinct Swift zonesit is necessary to
+#. If the nodes are in three distinct Swift zones it is necessary to
   whether the nodes have ring partitions in common. Run ``swift-ring``
   builder again, this time with the ``list_parts`` option and specify
-   the nodes under consideration. For example (all on one line):
+   the nodes under consideration. For example:

   .. code::

@ -302,12 +302,12 @@ Procedure

   .. code::

-      % sudo swift-ring-builder /etc/swift/object.builder list_parts <redacted>.8 <redacted>.15 <redacted>.72.2 | grep “3$” - wc \\-l
+      % sudo swift-ring-builder /etc/swift/object.builder list_parts <redacted>.8 <redacted>.15 <redacted>.72.2 | grep "3$" | wc -l

      30

 #. In this case the nodes have 30 out of a total of 2097152 partitions
-   in common; about 0.001%. In this case the risk is small nonzero.
+   in common; about 0.001%. In this case the risk is small/nonzero.
   Recall that a partition is simply a portion of the ring mapping
   space, not actual data. So having partitions in common is a necessary
   but not sufficient condition for data unavailability.
@ -320,3 +320,11 @@ Procedure
      If three nodes that have 3 partitions in common are all down, there is
      a nonzero probability that data are unavailable and we should work to
      bring some or all of the nodes up ASAP.
+
+Swift startup/shutdown
+~~~~~~~~~~~~~~~~~~~~~~
+
+-  Use reload - not stop/start/restart.
+
+-  Try to roll sets of servers (especially proxy) in groups of less
+   than 20% of your servers.
--- a/doc/source/ops_runbook/procedures.rst
+++ b/doc/source/ops_runbook/procedures.rst
@ -2,6 +2,8 @@
 Software configuration procedures
 =================================

+.. _fix_broken_gpt_table:
+
 Fix broken GPT table (broken disk partition)
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

@ -102,6 +104,8 @@ Fix broken GPT table (broken disk partition)

      $ sudo aptitude remove gdisk

+.. _fix_broken_xfs_filesystem:
+
 Procedure: Fix broken XFS filesystem
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

@ -165,7 +169,7 @@ Procedure: Fix broken XFS filesystem

   .. code::

-      $ sudo dd if=/dev/zero of=/dev/sdb2 bs=$((1024\*1024)) count=1
+      $ sudo dd if=/dev/zero of=/dev/sdb2 bs=$((1024*1024)) count=1
      1+0 records in
      1+0 records out
      1048576 bytes (1.0 MB) copied, 0.00480617 s, 218 MB/s
@ -187,129 +191,173 @@ Procedure: Fix broken XFS filesystem

      $ mount

+.. _checking_if_account_ok:
+
 Procedure: Checking if an account is okay
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. note::

   ``swift-direct`` is only available in the HPE Helion Public Cloud.
-   Use ``swiftly`` as an alternate.
+   Use ``swiftly`` as an alternate (or use ``swift-get-nodes`` as explained
+   here).

-If you have a tenant ID you can check the account is okay as follows from a proxy.
+You must know the tenant/project ID. You can check if the account is okay as follows from a proxy.

 .. code::

-   $ sudo -u swift  /opt/hp/swift/bin/swift-direct show <Api-Auth-Hash-or-TenantId>
+   $ sudo -u swift  /opt/hp/swift/bin/swift-direct show AUTH_<project-id>

 The response will either be similar to a swift list of the account
 containers, or an error indicating that the resource could not be found.

-In the latter case you can establish if a backend database exists for
-the tenantId by running the following on a proxy:
+Alternatively, you can use ``swift-get-nodes`` to find the account database
+files. Run the following on a proxy:

 .. code::

-   $ sudo -u swift  swift-get-nodes /etc/swift/account.ring.gz  <Api-Auth-Hash-or-TenantId>
+   $ sudo swift-get-nodes /etc/swift/account.ring.gz  AUTH_<project-id>

-The response will list ssh commands that will list the replicated
-account databases, if they exist.
+The response will print curl/ssh commands that will list the replicated
+account databases. Use the indicated ``curl`` or ``ssh`` commands to check
+the status and existence of the account.
+
+Procedure: Getting  swift account stats
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. note::
+
+   ``swift-direct`` is specific to the HPE Helion Public Cloud. Go look at
+   ``swifty`` for an alternate or use ``swift-get-nodes`` as explained
+   in :ref:`checking_if_account_ok`.
+
+This procedure describes how you determine the swift usage for a given
+swift account, that is the number of containers, number of objects and
+total bytes used. To do this you will need the project ID.
+
+Log onto one of the swift proxy servers.
+
+Use swift-direct to show this accounts usage:
+
+.. code::
+
+   $ sudo -u swift /opt/hp/swift/bin/swift-direct show AUTH_<project-id>
+   Status: 200
+         Content-Length: 0
+         Accept-Ranges: bytes
+         X-Timestamp: 1379698586.88364
+         X-Account-Bytes-Used: 67440225625994
+         X-Account-Container-Count: 1
+         Content-Type: text/plain; charset=utf-8
+         X-Account-Object-Count: 8436776
+         Status: 200
+         name: my_container  count: 8436776  bytes: 67440225625994
+
+This account has 1 container. That container has 8436776 objects. The
+total bytes used is 67440225625994.

 Procedure: Revive a deleted account
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-Swift accounts are normally not recreated. If a tenant unsubscribes from
-Swift, the account is deleted. To re-subscribe to Swift, you can create
-a new tenant (new tenant ID), and subscribe to Swift. This creates a
-new Swift account with the new tenant ID.
+Swift accounts are normally not recreated. If a tenant/project is deleted,
+the account can then be deleted. If the user wishes to use Swift again,
+the normal process is to create a new tenant/project -- and hence a
+new Swift account.

-However, until the unsubscribe/new tenant process is supported, you may
-hit a situation where a Swift account is deleted and the user is locked
-out of Swift.
+However, if the Swift account is deleted, but the tenant/project is not
+deleted from Keystone, the user can no longer access the account. This
+is because the account is marked deleted in Swift. You can revive
+the account as described in this process.

-Deleting the account database files
-----------------------------------
+.. note::

-Here is one possible solution. The containers and objects may be lost
-forever. The solution is to delete the account database files and
-re-create the account. This may only be done once the containers and
-objects are completely deleted. This process is untested, but could
-work as follows:
+    The containers and objects in the "old" account cannot be listed
+    anymore. In addition, if the Account Reaper process has not
+    finished reaping the containers and objects in the "old" account, these
+    are effectively orphaned and it is virtually impossible to find and delete
+    them to free up disk space.

-#. Use swift-get-nodes to locate the account's database file (on three
-   servers).
+The solution is to delete the account database files and
+re-create the account as follows:

-#. Rename the database files (on three servers).
+#. You must know the tenant/project ID. The account name is AUTH_<project-id>.
+   In this example, the tenant/project is is ``4ebe3039674d4864a11fe0864ae4d905``
+   so the Swift account name is ``AUTH_4ebe3039674d4864a11fe0864ae4d905``.

-#. Use ``swiftly`` to create the account (use original name).
-
-Renaming account database so it can be revived
----------------------------------------------
-
-Get the locations of the database files that hold the account data.
+#. Use ``swift-get-nodes`` to locate the account's database files (on three
+   servers). The output has been truncated so we can focus on the import pieces
+   of data:

   .. code::

-      sudo swift-get-nodes /etc/swift/account.ring.gz AUTH_redacted-1856-44ae-97db-31242f7ad7a1
+       $ sudo swift-get-nodes /etc/swift/account.ring.gz AUTH_4ebe3039674d4864a11fe0864ae4d905
+       ...
+       curl -I -XHEAD "http://192.168.245.5:6002/disk1/3934/AUTH_4ebe3039674d4864a11fe0864ae4d905"
+       curl -I -XHEAD "http://192.168.245.3:6002/disk0/3934/AUTH_4ebe3039674d4864a11fe0864ae4d905"
+       curl -I -XHEAD "http://192.168.245.4:6002/disk1/3934/AUTH_4ebe3039674d4864a11fe0864ae4d905"
+       ...
+       Use your own device location of servers:
+       such as "export DEVICE=/srv/node"
+       ssh 192.168.245.5 "ls -lah ${DEVICE:-/srv/node*}/disk1/accounts/3934/052/f5ecf8b40de3e1b0adb0dbe576874052"
+       ssh 192.168.245.3 "ls -lah ${DEVICE:-/srv/node*}/disk0/accounts/3934/052/f5ecf8b40de3e1b0adb0dbe576874052"
+       ssh 192.168.245.4 "ls -lah ${DEVICE:-/srv/node*}/disk1/accounts/3934/052/f5ecf8b40de3e1b0adb0dbe576874052"
+       ...
+       note: `/srv/node*` is used as default value of `devices`, the real value is set in the config file on each storage node.

-      Account  AUTH_redacted-1856-44ae-97db-31242f7ad7a1
-      Container None

-      Object    None
+#. Before proceeding check that the account is really deleted by using curl. Execute the
+   commands printed by ``swift-get-nodes``. For example:

-      Partition 18914
+   .. code::

-      Hash        93c41ef56dd69173a9524193ab813e78
+       $ curl -I -XHEAD "http://192.168.245.5:6002/disk1/3934/AUTH_4ebe3039674d4864a11fe0864ae4d905"
+       HTTP/1.1 404 Not Found
+       Content-Length: 0
+       Content-Type: text/html; charset=utf-8

-      Server:Port Device 15.184.9.126:6002 disk7
-      Server:Port Device 15.184.9.94:6002 disk11
-      Server:Port Device 15.184.9.103:6002 disk10
-      Server:Port Device 15.184.9.80:6002 disk2  [Handoff]
-      Server:Port Device 15.184.9.120:6002 disk2  [Handoff]
-      Server:Port Device 15.184.9.98:6002 disk2  [Handoff]
+   Repeat for the other two servers (192.168.245.3 and 192.168.245.4).
+   A ``404 Not Found`` indicates that the account is deleted (or never existed).

-      curl -I -XHEAD "`*http://15.184.9.126:6002/disk7/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.126:6002/disk7/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_
-      curl -I -XHEAD "`*http://15.184.9.94:6002/disk11/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.94:6002/disk11/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_
+   If you get a ``204 No Content`` response, do **not** proceed.

-      curl -I -XHEAD "`*http://15.184.9.103:6002/disk10/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.103:6002/disk10/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_
+#. Use the ssh commands printed by ``swift-get-nodes`` to check if database
+   files exist. For example:

-      curl -I -XHEAD "`*http://15.184.9.80:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.80:6002/disk2/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ # [Handoff]
-      curl -I -XHEAD "`*http://15.184.9.120:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.120:6002/disk2/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ # [Handoff]
-      curl -I -XHEAD "`*http://15.184.9.98:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.98:6002/disk2/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ # [Handoff]
+   .. code::

-      ssh 15.184.9.126 "ls -lah /srv/node/disk7/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"
-      ssh 15.184.9.94 "ls -lah /srv/node/disk11/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"
-      ssh 15.184.9.103 "ls -lah /srv/node/disk10/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"
-      ssh 15.184.9.80 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff]
-      ssh 15.184.9.120 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff]
-      ssh 15.184.9.98 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff]
+       $  ssh 192.168.245.5 "ls -lah ${DEVICE:-/srv/node*}/disk1/accounts/3934/052/f5ecf8b40de3e1b0adb0dbe576874052"
+       total 20K
+       drwxr-xr-x 2 swift swift 110 Mar  9 10:22 .
+       drwxr-xr-x 3 swift swift  45 Mar  9 10:18 ..
+       -rw------- 1 swift swift 17K Mar  9 10:22 f5ecf8b40de3e1b0adb0dbe576874052.db
+       -rw-r--r-- 1 swift swift   0 Mar  9 10:22 f5ecf8b40de3e1b0adb0dbe576874052.db.pending
+       -rwxr-xr-x 1 swift swift   0 Mar  9 10:18 .lock

-      $ sudo swift-get-nodes /etc/swift/account.ring.gz AUTH\_redacted-1856-44ae-97db-31242f7ad7a1Account  AUTH_redacted-1856-44ae-97db-
-      31242f7ad7a1Container  NoneObject      NonePartition   18914Hash           93c41ef56dd69173a9524193ab813e78Server:Port Device  15.184.9.126:6002 disk7Server:Port Device   15.184.9.94:6002 disk11Server:Port Device   15.184.9.103:6002 disk10Server:Port Device  15.184.9.80:6002
-      disk2   [Handoff]Server:Port Device    15.184.9.120:6002 disk2  [Handoff]Server:Port Device    15.184.9.98:6002 disk2   [Handoff]curl -I -XHEAD
-      "`*http://15.184.9.126:6002/disk7/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"*<http://15.184.9.126:6002/disk7/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ curl -I -XHEAD
+   Repeat for the other two servers (192.168.245.3 and 192.168.245.4).

-      "`*http://15.184.9.94:6002/disk11/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.94:6002/disk11/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ curl -I -XHEAD
+   If no files exist, no further action is needed.

-      "`*http://15.184.9.103:6002/disk10/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.103:6002/disk10/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ curl -I -XHEAD
+#. Stop Swift processes on all nodes listed by ``swift-get-nodes``
+   (In this example, that is 192.168.245.3, 192.168.245.4 and 192.168.245.5).

-      "`*http://15.184.9.80:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.80:6002/disk2/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ # [Handoff]curl -I -XHEAD
+#. We recommend you make backup copies of the database files.

-      "`*http://15.184.9.120:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.120:6002/disk2/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ # [Handoff]curl -I -XHEAD
+#. Delete the database files. For example:

-      "`*http://15.184.9.98:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.98:6002/disk2/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ # [Handoff]ssh 15.184.9.126
+   .. code::

-      "ls -lah /srv/node/disk7/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"ssh 15.184.9.94 "ls -lah /srv/node/disk11/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"ssh 15.184.9.103
-      "ls -lah /srv/node/disk10/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"ssh 15.184.9.80 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff]ssh 15.184.9.120
-      "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff]ssh 15.184.9.98 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff]
+       $ ssh 192.168.245.5
+       $ cd /srv/node/disk1/accounts/3934/052/f5ecf8b40de3e1b0adb0dbe576874052
+       $ sudo rm *

-Check that the handoff nodes do not have account databases:
+   Repeat for the other two servers (192.168.245.3 and 192.168.245.4).

-.. code::
+#. Restart Swift on all three servers

-   $ ssh 15.184.9.80 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"
-   ls: cannot access /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/: No such file or directory
+At this stage, the account is fully deleted. If you enable the auto-create option, the
+next time the user attempts to access the account, the account will be created.
+You may also use swiftly to recreate the account.

-If the handoff node has a database, wait for rebalancing to occur.

 Procedure: Temporarily stop load balancers from directing traffic to a proxy server
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@ -319,7 +367,7 @@ follows. This can be useful when a proxy is misbehaving but you need
 Swift running to help diagnose the problem. By removing from the load
 balancers, customer's are not impacted by the misbehaving proxy.

-#. Ensure that in proxyserver.com the ``disable_path`` variable is set to
+#. Ensure that in /etc/swift/proxy-server.conf the ``disable_path`` variable is set to
   ``/etc/swift/disabled-by-file``.

 #. Log onto the proxy node.
@ -330,9 +378,9 @@ balancers, customer's are not impacted by the misbehaving proxy.

      sudo swift-init proxy shutdown

-      .. note::
+   .. note::

-         Shutdown, not stop.
+      Shutdown, not stop.

 #. Create the ``/etc/swift/disabled-by-file`` file. For example:

@ -346,13 +394,10 @@ balancers, customer's are not impacted by the misbehaving proxy.

      sudo swift-init proxy start

-It works because the healthcheck middleware looks for this file. If it
-find it, it will return 503 error instead of 200/OK. This means the load balancer
+It works because the healthcheck middleware looks for /etc/swift/disabled-by-file.
+If it exists, the middleware will return 503/error instead of 200/OK. This means the load balancer
 should stop sending traffic to the proxy.

-``/healthcheck`` will report
-``FAIL: disabled by file`` if the ``disabled-by-file`` file exists.
-
 Procedure: Ad-Hoc disk performance test
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

--- a/doc/source/ops_runbook/sec-furtherdiagnose.rst
+++ b/doc/source/ops_runbook/sec-furtherdiagnose.rst
@ -1,177 +0,0 @@
-==============================
-Further issues and resolutions
-==============================
-
-.. note::
-
-   The urgency levels in each **Action** column indicates whether or
-   not it is required to take immediate action, or if the problem can be worked
-   on during business hours.
-
-.. list-table::
-   :widths: 33 33 33
-   :header-rows: 1
-
-   * - **Scenario**
-     - **Description**
-     - **Action**
-   * - ``/healthcheck`` latency is high.
-     - The ``/healthcheck`` test does not tax the proxy very much so any drop in value is probably related to
-       network issues, rather than the proxies being very busy. A very slow proxy might impact the average
-       number, but it would need to be very slow to shift the number that much.
-     - Check networks. Do a ``curl https://<ip-address>/healthcheck where ip-address`` is individual proxy
-       IP address to see if you can pin point a problem in the network.
-
-       Urgency: If there are other indications that your system is slow, you should treat
-       this as an urgent problem.
-   * - Swift process is not running.
-     - You can use ``swift-init`` status to check if swift processes are running on any
-       given server.
-     - Run this command:
-       .. code::
-
-          sudo swift-init all start
-
-       Examine messages in the swift log files to see if there are any
-       error messages related to any of the swift processes since the time you
-       ran the ``swift-init`` command.
-
-       Take any corrective actions that seem necessary.
-
-       Urgency: If this only affects one server, and you have more than one,
-       identifying and fixing the problem can wait until business hours.
-       If this same problem affects many servers, then you need to take corrective
-       action immediately.
-   * - ntpd is not running.
-     - NTP is not running.
-     - Configure and start NTP.
-       Urgency: For proxy servers, this is vital.
-
-   * - Host clock is not syncd to an NTP server.
-     - Node time settings does not match NTP server time.
-       This may take some time to sync after a reboot.
-     - Assuming NTP is configured and running, you have to wait until the times sync.
-   * - A swift process has hundreds, to thousands of open file descriptors.
-     - May happen to any of the swift processes.
-       Known to have happened with a ``rsyslod restart`` and where ``/tmp`` was hanging.
-
-     - Restart the swift processes on the affected node:
-
-       .. code::
-
-          % sudo swift-init all reload
-
-       Urgency:
-                If known performance problem: Immediate
-
-                If system seems fine: Medium
-   * - A swift process is not owned by the swift user.
-     - If the UID of the swift user has changed, then the processes might not be
-       owned by that UID.
-     - Urgency: If this only affects one server, and you have more than one,
-       identifying and fixing the problem can wait until business hours.
-       If this same problem affects many servers, then you need to take corrective
-       action immediately.
-   * - Object account or container files not owned by swift.
-     - This typically happens if during a reinstall or a re-image of a server that the UID
-       of the swift user was changed. The data files in the object account and container
-       directories are owned by the original swift UID. As a result, the current swift
-       user does not own these files.
-     - Correct the UID of the swift user to reflect that of the original UID. An alternate
-       action is to change the ownership of every file on all file systems. This alternate
-       action is often impractical and will take considerable time.
-
-       Urgency: If this only affects one server, and you have more than one,
-       identifying and fixing the problem can wait until business hours.
-       If this same problem affects many servers, then you need to take corrective
-       action immediately.
-   * - A disk drive has a high IO wait or service time.
-     - If high wait IO times are seen for a single disk, then the disk drive is the problem.
-       If most/all devices are slow, the controller is probably the source of the problem.
-       The controller cache may also be miss configured – which will cause similar long
-       wait or service times.
-     - As a first step, if your controllers have a cache, check that it is enabled and their battery/capacitor
-       is working.
-
-       Second, reboot the server.
-       If problem persists, file a DC ticket to have the drive or controller replaced.
-       See `Diagnose: Slow disk devices` on how to check the drive wait or service times.
-
-       Urgency: Medium
-   * - The network interface is not up.
-     - Use the ``ifconfig`` and ``ethtool`` commands to determine the network state.
-     - You can try restarting the interface. However, generally the interface
-       (or cable) is probably broken, especially if the interface is flapping.
-
-       Urgency: If this only affects one server, and you have more than one,
-       identifying and fixing the problem can wait until business hours.
-       If this same problem affects many servers, then you need to take corrective
-       action immediately.
-   * - Network interface card (NIC) is not operating at the expected speed.
-     - The NIC is running at a slower speed than its nominal rated speed.
-       For example, it is running at 100 Mb/s and the NIC is a 1Ge NIC.
-     - 1. Try resetting the interface with:
-
-       .. code::
-
-          sudo ethtool -s eth0 speed 1000
-
-       ... and then run:
-
-       .. code::
-
-          sudo lshw -class
-
-       See if size goes to the expected speed. Failing
-       that, check hardware (NIC cable/switch port).
-
-       2. If persistent, consider shutting down the server (especially if a proxy)
-          until the problem is identified and resolved. If you leave this server
-          running it can have a large impact on overall performance.
-
-       Urgency: High
-   * - The interface RX/TX error count is non-zero.
-     - A value of 0 is typical, but counts of 1 or 2 do not indicate a problem.
-     - 1. For low numbers (For example, 1 or 2), you can simply ignore. Numbers in the range
-          3-30 probably indicate that the error count has crept up slowly over a long time.
-          Consider rebooting the server to remove the report from the noise.
-
-          Typically, when a cable or interface is bad, the error count goes to 400+. For example,
-          it stands out. There may be other symptoms such as the interface going up and down or
-          not running at correct speed. A server with a high error count should be watched.
-
-       2. If the error count continue to climb, consider taking the server down until
-          it can be properly investigated. In any case, a reboot should be done to clear
-          the error count.
-
-       Urgency: High, if the error count increasing.
-
-   * - In a swift log you see a message that a process has not replicated in over 24 hours.
-     - The replicator has not successfully completed a run in the last 24 hours.
-       This indicates that the replicator has probably hung.
-     - Use ``swift-init`` to stop and then restart the replicator process.
-
-       Urgency: Low (high if recent adding or replacement of disk drives), however if you
-       recently added or replaced disk drives then you should treat this urgently.
-   * - Container Updater has not run in 4 hour(s).
-     - The service may appear to be running however, it may be hung. Examine their swift
-       logs to see if there are any error messages relating to the container updater. This
-       may potentially explain why the container is not running.
-     - Urgency: Medium
-       This may have been triggered by a recent restart of the  rsyslog daemon.
-       Restart the service with:
-       .. code::
-
-          sudo swift-init <service> reload
-   * - Object replicator: Reports the remaining time and that time is more than 100 hours.
-     - Each replication cycle the object replicator writes a log message to its log
-       reporting statistics about the current cycle. This includes an estimate for the
-       remaining time needed to replicate all objects. If this time is longer than
-       100 hours, there is a problem with the replication process.
-     - Urgency: Medium
-       Restart the service with:
-       .. code::
-
-          sudo swift-init object-replicator reload
-
-       Check that the remaining replication time is going down.
--- a/doc/source/ops_runbook/troubleshooting.rst
+++ b/doc/source/ops_runbook/troubleshooting.rst
@ -18,16 +18,14 @@ files. For example:

 .. code::

-   $ PDSH_SSH_ARGS_APPEND="-o StrictHostKeyChecking=no" pdsh -l <yourusername> -R ssh
-
-    -w <redacted>.68.[4-11,132-139 4-11,132-139],<redacted>.132.[4-11,132-139
-    4-11,132-139] 'sudo bzgrep -w AUTH_redacted-4962-4692-98fb-52ddda82a5af /var/log/swift/proxy.log\*' 
-    dshbak -c
+   $ PDSH_SSH_ARGS_APPEND="-o StrictHostKeyChecking=no" pdsh -l <yourusername> -R ssh \
+     -w <redacted>.68.[4-11,132-139 4-11,132-139],<redacted>.132.[4-11,132-139] \
+     'sudo bzgrep -w AUTH_redacted-4962-4692-98fb-52ddda82a5af /var/log/swift/proxy.log*' |  dshbak -c
    .
    .
-    \---------------\-
+    ----------------
    <redacted>.132.6
-    \---------------\-
+    ----------------
    Feb 29 08:51:57 sw-aw2az2-proxy011 proxy-server <redacted>.16.132
    <redacted>.66.8 29/Feb/2012/08/51/57 GET /v1.0/AUTH_redacted-4962-4692-98fb-52ddda82a5af
    /%3Fformat%3Djson HTTP/1.0 404 - - <REDACTED>_4f4d50c5e4b064d88bd7ab82 - - -
@ -37,52 +35,49 @@ This shows a ``GET`` operation on the users account.

 .. note::

-   The HTTP status returned is 404, not found, rather than 500 as reported by the user.
+   The HTTP status returned is 404, Not found, rather than 500 as reported by the user.

 Using the transaction ID, ``tx429fc3be354f434ab7f9c6c4206c1dc3`` you can
 search the swift object servers log files for this transaction ID:

 .. code::

-   $ PDSH_SSH_ARGS_APPEND="-o StrictHostKeyChecking=no" pdsh -l <yourusername>
-
-   -R ssh
-   -w <redacted>.72.[4-67|4-67],<redacted>.[4-67|4-67],<redacted>.[4-67|4-67],<redacted>.204.[4-131| 4-131]
-   'sudo bzgrep tx429fc3be354f434ab7f9c6c4206c1dc3 /var/log/swift/server.log*'
-      | dshbak -c
+   $ PDSH_SSH_ARGS_APPEND="-o StrictHostKeyChecking=no" pdsh -l <yourusername> -R ssh \
+     -w <redacted>.72.[4-67|4-67],<redacted>.[4-67|4-67],<redacted>.[4-67|4-67],<redacted>.204.[4-131] \
+     'sudo bzgrep tx429fc3be354f434ab7f9c6c4206c1dc3 /var/log/swift/server.log*' | dshbak -c
   .
   .
-   \---------------\-
+   ----------------
   <redacted>.72.16
-   \---------------\-
+   ----------------
   Feb 29 08:51:57 sw-aw2az1-object013 account-server <redacted>.132.6 - -

   [29/Feb/2012:08:51:57 +0000|] "GET /disk9/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
   404 - "tx429fc3be354f434ab7f9c6c4206c1dc3" "-" "-"

   0.0016 ""
-    \---------------\-
-    <redacted>.31
-    \---------------\-
-    Feb 29 08:51:57 node-az2-object060 account-server <redacted>.132.6 - -
-    [29/Feb/2012:08:51:57 +0000|] "GET /disk6/198875/AUTH_redacted-4962-
-    4692-98fb-52ddda82a5af" 404 - "tx429fc3be354f434ab7f9c6c4206c1dc3" "-" "-" 0.0011 ""
-    \---------------\-
-    <redacted>.204.70
-    \---------------\-
+   ----------------
+   <redacted>.31
+   ----------------
+   Feb 29 08:51:57 node-az2-object060 account-server <redacted>.132.6 - -
+   [29/Feb/2012:08:51:57 +0000|] "GET /disk6/198875/AUTH_redacted-4962-
+   4692-98fb-52ddda82a5af" 404 - "tx429fc3be354f434ab7f9c6c4206c1dc3" "-" "-" 0.0011 ""
+   ----------------
+   <redacted>.204.70
+   ----------------

-    Feb 29 08:51:57 sw-aw2az3-object0067 account-server <redacted>.132.6 - -
-    [29/Feb/2012:08:51:57 +0000|] "GET /disk6/198875/AUTH_redacted-4962-
-    4692-98fb-52ddda82a5af" 404 - "tx429fc3be354f434ab7f9c6c4206c1dc3" "-" "-" 0.0014 ""
+   Feb 29 08:51:57 sw-aw2az3-object0067 account-server <redacted>.132.6 - -
+   [29/Feb/2012:08:51:57 +0000|] "GET /disk6/198875/AUTH_redacted-4962-
+   4692-98fb-52ddda82a5af" 404 - "tx429fc3be354f434ab7f9c6c4206c1dc3" "-" "-" 0.0014 ""

 .. note::

   The 3 GET operations to 3 different object servers that hold the 3
   replicas of this users account. Each ``GET`` returns a HTTP status of 404,
-   not found.
+   Not found.

 Next, use the ``swift-get-nodes`` command to determine exactly where the
-users account data is stored:
+user's account data is stored:

 .. code::

@ -114,23 +109,23 @@ users account data is stored:
   curl -I -XHEAD "`http://<redacted>.72.27:6002/disk11/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
   <http://15.185.72.27:6002/disk11/198875/AUTH_db0050ad-4962-4692-98fb-52ddda82a5af>`_ # [Handoff]

-   ssh <redacted>.31 "ls \-lah /srv/node/disk6/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/"
-   ssh <redacted>.204.70 "ls \-lah /srv/node/disk6/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/"
-   ssh <redacted>.72.16 "ls \-lah /srv/node/disk9/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/"
-   ssh <redacted>.204.64 "ls \-lah /srv/node/disk11/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/" # [Handoff]
-   ssh <redacted>.26 "ls \-lah /srv/node/disk11/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/" # [Handoff]
-   ssh <redacted>.72.27 "ls \-lah /srv/node/disk11/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/" # [Handoff]
+   ssh <redacted>.31 "ls -lah /srv/node/disk6/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/"
+   ssh <redacted>.204.70 "ls -lah /srv/node/disk6/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/"
+   ssh <redacted>.72.16 "ls -lah /srv/node/disk9/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/"
+   ssh <redacted>.204.64 "ls -lah /srv/node/disk11/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/" # [Handoff]
+   ssh <redacted>.26 "ls -lah /srv/node/disk11/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/" # [Handoff]
+   ssh <redacted>.72.27 "ls -lah /srv/node/disk11/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/" # [Handoff]

 Check each of the primary servers, <redacted>.31, <redacted>.204.70  and <redacted>.72.16, for
 this users account. For example on <redacted>.72.16:

 .. code::

-   $ ls \\-lah /srv/node/disk9/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/
+   $ ls -lah /srv/node/disk9/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/
   total 1.0M
   drwxrwxrwx 2 swift swift 98 2012-02-23 14:49 .
   drwxrwxrwx 3 swift swift 45 2012-02-03 23:28 ..
-   -rw-\\-----\\- 1 swift swift 15K 2012-02-23 14:49 1846d99185f8a0edaf65cfbf37439696.db
+   -rw------- 1 swift swift 15K 2012-02-23 14:49 1846d99185f8a0edaf65cfbf37439696.db
   -rw-rw-rw- 1 swift swift 0 2012-02-23 14:49 1846d99185f8a0edaf65cfbf37439696.db.pending

 So this users account db, an sqlite db is present. Use sqlite to
@ -155,7 +150,7 @@ checkout the account:
   status_changed_at = 1330001026.00514
   metadata =

-.. note::
+.. note:

   The status is ``DELETED``. So this account was deleted. This explains
   why the GET operations are returning 404, not found. Check the account
@ -174,14 +169,14 @@ server logs:

 .. code::

-   $ PDSH_SSH_ARGS_APPEND="-o StrictHostKeyChecking=no" pdsh -l <yourusername> -R ssh -w <redacted>.68.[4-11,132-139 4-11,132-
-   139],<redacted>.132.[4-11,132-139|4-11,132-139] 'sudo bzgrep AUTH_redacted-4962-4692-98fb-52ddda82a5af /var/log/swift/proxy.log\* | grep -w
-   DELETE |awk "{print \\$3,\\$10,\\$12}"' |- dshbak -c
+   $ PDSH_SSH_ARGS_APPEND="-o StrictHostKeyChecking=no" pdsh -l <yourusername> -R ssh \
+     -w <redacted>.68.[4-11,132-139 4-11,132-139],<redacted>.132.[4-11,132-139|4-11,132-139] \
+     'sudo bzgrep AUTH_redacted-4962-4692-98fb-52ddda82a5af /var/log/swift/proxy.log* \
+     | grep -w DELETE | awk "{print $3,$10,$12}"' |- dshbak -c
   .
   .
-   Feb 23 12:43:46 sw-aw2az2-proxy001 proxy-server 15.203.233.76 <redacted>.66.7 23/Feb/2012/12/43/46 DELETE /v1.0/AUTH_redacted-4962-4692-98fb-
+   Feb 23 12:43:46 sw-aw2az2-proxy001 proxy-server <redacted> <redacted>.66.7 23/Feb/2012/12/43/46 DELETE /v1.0/AUTH_redacted-4962-4692-98fb-
   52ddda82a5af/ HTTP/1.0 204 - Apache-HttpClient/4.1.2%20%28java%201.5%29 <REDACTED>_4f458ee4e4b02a869c3aad02 - - -
-
   tx4471188b0b87406899973d297c55ab53 - 0.0086

 From this you can see the operation that resulted in the account being deleted.
@ -252,8 +247,8 @@ Finally, use ``swift-direct`` to delete the container.
 Procedure: Decommissioning swift nodes
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-Should Swift nodes need to be decommissioned. For example, where they are being
-re-purposed, it is very important to follow the following steps.
+Should Swift nodes need to be decommissioned (e.g.,, where they are being
+re-purposed), it is very important to follow the following steps.

 #. In the case of object servers, follow the procedure for removing
   the node from the rings.