docs/doc/source/fault-mgmt/kubernetes/troubleshooting-log-collection.rst
Suzana Fernandes 854eb3de54 Update for collect tool enhancement to support any LDAP or WAD user account
Change-Id: I94709d5b1db72121c27dcc60725dbf58de21b490
Signed-off-by: Suzana Fernandes <Suzana.Fernandes@windriver.com>
2024-11-18 14:14:20 +00:00

5.1 KiB

Troubleshoot Log Collection

The log collection tool gathers detailed information.

Collect Tool Caveats and Usage

  • Log in via SSH or local console on the active controller and use the collect command.

    Note

    The user must have sudo capability and be in the sys_protected group to use the collect tool.

  • All usage options can be found by using the following command:

    (keystone_admin)$ collect --help
  • For Simplex or Duplex systems, use the following command:

    (keystone_admin)$ collect --all
  • For Standard systems, use the following commands:

    • For a small deployment (less than two worker nodes):

      (keystone_admin)$ collect --all

      You can also use the short form -a for this option.

      Note

      Hosts or subclouds explicitly added with the --all option will be ignored.

    • For large deployments:

      (keystone_admin)$ collect host1 host2 host3

      Or you can use the --list option. This syntax is deprecated.

      (keystone_admin)$ collect --list host1 host2 host3

      You can also use the short form -l for this option.

      Note

      Systems and subclouds are collected in parallel to reduce the overall collection time. Use the --inline (or -in) option to collect serially. --inline can be combined with the --all option.

      (keystone_admin)$ collect --all [--timeout | -t] <minutes>

      Note

      For large deployments, the default timeout value (20 minutes) may need to be increased by using the --timeout (-t) option.

      The timeout for collecting from the local host, the host that collect is run from, does adopt the global timeout.

      To fix that, run collect with an extended --timeout locally on the host that is experiencing the timeout. That way the global timeout applies.

      Optionally, you can modify the default COLLECT_HOST_TIMEOUT_DEFAULT value in the /etc/collect/collect_timeouts file. That requires sudo command and no processes need to be restarted after the change. All subsequent collects will adopt the new values in that file.

    • For subcloud deployments:

      (keystone_admin)$ collect --subcloud subcloud1 subcloud2 subcloud3

      You can also use the short form -sc for this option. The --subcloud and --all options can be combined.

      (keystone_admin)$ collect --all --subcloud

      Note

      The --all (-a) option is not recommended with large subcloud deployments due to disk storage requirements.

  • For systems with an up-time of more than 2 months, use the date range options. The default behavior is to collect one month of logs.

    Use --start-date for the collection of logs on and after a given date:

    (keystone_admin)$ collect [--start-date | -s] <YYYYMMDD>

    Use --end-date for the collection of logs on and before a given date:

    (keystone_admin)$ collect [--end-date | -s] <YYYYMMDD>
  • To prefix the collect tar ball name and easily identify the collect when several are present, use the following command.

    (keystone_admin)$ collect [--name | -n] <prefix>

    For example, the following prepends TEST1 to the name of the tarball:

    (keystone_admin)$ collect --name TEST1
    [sudo] password for sysadmin:
    collecting data from 1 host(s): controller-0
    collecting controller-0_20200316.155805 ... done (00:01:39   56M)
    creating user-named tarball /scratch/TEST1_20200316.155805.tar ... done (00:01:39   56M)
  • Prior to using the collect command, the nodes need to be unlocked-enabled or disabled online and are required to be unlocked at least once.

  • Lock the node and wait for the node to reach the disabled-online state before collecting logs for a node that is rebooting indefinitely.

  • You may be required to run the local collect command if the collect tool running from the active controller node fails to collect logs from one of the system nodes. Execute the collect command using the console or connection on the node that displays the failure.

partner