nodepool/doc/source/operation.rst
James E. Blair 07c83f555d Add ZK cache stats
To observe the performance of the ZK connection and the new tree
caches, add some statsd metrics for each of these.  This will
let us monitor queue size over time.

Also, update the assertReportedStat method to output all received
stats if the expected stat was not found (like Zuul).

Change-Id: Ia7e1e0980fdc34007f80371ee0a77d4478948518
Depends-On: https://review.opendev.org/886552
2023-08-03 10:27:25 -07:00

22 KiB

Operation

Nodepool has two components which run as daemons. The nodepool-builder daemon is responsible for building diskimages and uploading them to providers, and the nodepool-launcher daemon is responsible for launching and deleting nodes.

Both daemons frequently re-read their configuration file after starting to support adding or removing new images and providers, or otherwise altering the configuration.

These daemons communicate with each other via a Zookeeper database. You must run Zookeeper and at least one of each of these daemons to have a functioning Nodepool installation.

Nodepool-builder

The nodepool-builder daemon builds and uploads images to providers. It may be run on the same or a separate host as the main nodepool daemon. Multiple instances of nodepool-builder may be run on the same or separate hosts in order to speed up image builds across many machines, or supply high-availability or redundancy. However, since nodepool-builder allows specification of the number of both build and upload threads, it is usually not advantageous to run more than a single instance on one machine. Note that while diskimage-builder (which is responsible for building the underlying images) generally supports executing multiple builds on a single machine simultaneously, some of the elements it uses may not. To be safe, it is recommended to run a single instance of nodepool-builder on a machine, and configure that instance to run only a single build thread (the default).

Nodepool-launcher

The main nodepool daemon is named nodepool-launcher and is responsible for managing cloud instances launched from the images created and uploaded by nodepool-builder.

When a new image is created and uploaded, nodepool-launcher will immediately start using it when launching nodes (Nodepool always uses the most recent image for a given provider in the ready state). Nodepool will delete images if they are not the most recent or second most recent ready images. In other words, Nodepool will always make sure that in addition to the current image, it keeps the previous image around. This way if you find that a newly created image is problematic, you may simply delete it and Nodepool will revert to using the previous image.

Daemon usage

To start the main Nodepool daemon, run nodepool-launcher:

nodepool-launcher --help

To start the nodepool-builder daemon, run nodepool--builder:

nodepool-builder --help

To stop a daemon, send SIGINT to the process.

When yappi (Yet Another Python Profiler) is available, additional functions' and threads' stats are emitted as well. The first SIGUSR2 will enable yappi, on the second SIGUSR2 it dumps the information collected, resets all yappi state and stops profiling. This is to minimize the impact of yappi on a running system.

Metadata

When Nodepool creates instances, it will assign the following nova metadata:

groups

A comma separated list containing the name of the image and the name of the provider. This may be used by the Ansible OpenStack inventory plugin.

nodepool_image_name

The name of the image as a string.

nodepool_provider_name

The name of the provider as a string.

nodepool_node_id

The nodepool id of the node as an integer.

Common Management Tasks

In the course of running a Nodepool service you will find that there are some common operations that will be performed. Like the services themselves these are split into two groups, image management and instance management.

Image Management

Before Nodepool can launch any cloud instances it must have images to boot off of. nodepool dib-image-list will show you which images are available locally on disk. These images on disk are then uploaded to clouds, nodepool image-list will show you what images are bootable in your various clouds.

If you need to force a new image to be built to pick up a new feature more quickly than the normal rebuild cycle (which defaults to 24 hours) you can manually trigger a rebuild. Using nodepool image-build you can tell Nodepool to begin a new image build now. Note that depending on work that the nodepool-builder is already performing this may queue the build. Check nodepool dib-image-list to see the current state of the builds. Once the image is built it is automatically uploaded to all of the clouds configured to use that image.

At times you may need to stop using an existing image because it is broken. Your two major options here are to build a new image to replace the existing image or to delete the existing image and have Nodepool fall back on using the previous image. Rebuilding and uploading can be slow so typically the best option is to simply nodepool image-delete the most recent image which will cause Nodepool to fallback on using the previous image. Howevever, if you do this without "pausing" the image it will be immediately reuploaded. You will want to pause the image if you need to further investigate why the image is not being built correctly. If you know the image will be built correctly you can simple delete the built image and remove it from all clouds which will cause it to be rebuilt using nodepool dib-image-delete.

Command Line Tools

Usage

The general options that apply to all subcommands are:

nodepool --help

The following subcommands deal with nodepool images:

dib-image-list

nodepool dib-image-list --help

image-status

nodepool image-status --help

image-list

nodepool image-list --help

image-build

nodepool image-build --help

dib-image-delete

nodepool dib-image-delete --help

image-delete

nodepool image-delete --help

The following subcommands deal with nodepool nodes:

list

nodepool list --help

delete

nodepool delete --help

hold

nodepool hold --help

The following subcommands deal with ZooKeeper data management:

info

nodepool info --help

erase

nodepool erase --help

If Nodepool's database gets out of sync with reality, the following commands can help identify compute instances or images that are unknown to Nodepool:

alien-image-list

nodepool alien-image-list --help

Image builds and uploads can take a lot of time, so there is a pair of commands to export and import the image build and upload metadata from Nodepool's internal storage in ZooKeeper. These can be used to backup and restore data in case the ZooKeeper cluster is lost. Note that these commands do not save or restore the actual image data, only the records in ZooKeeper. If the data are important, consider backing them up as well. Even without the local image builds, restoring the image metadata will allow nodepool-launcher to continue to operate while new builds are created.

These commands do not export or import any node information. It is expected that any existing nodes will be detected as leaked and automatically deleted if the ZooKeeper storage is reset.

export-image-data

nodepool export-image-data --help

import-image-data

nodepool import-image-data --help

Removing a Provider

Removing a provider from nodepool involves two separate steps: removing from the builder process, and removing from the launcher process.

Warning

Since the launcher process depends on images being present in the provider, you should follow the process for removing a provider from the launcher before doing the steps to remove it from the builder.

Removing from the Launcher

To remove a provider from the launcher, set that provider's max-servers value to 0 (or any value less than 0). This disables the provider and will instruct the launcher to stop booting new nodes on the provider. You can then let the nodes go through their normal lifecycle. Once all nodes have been deleted, you may remove the provider from launcher configuration file entirely, although leaving it in this state is effectively the same and makes it easy to turn the provider back on.

Note

There is currently no way to force the launcher to immediately begin deleting any unused instances from a disabled provider. If urgency is required, you can delete the nodes directly instead of waiting for them to go through their normal lifecycle, but the effect is the same.

For example, if you want to remove ProviderA from a launcher with a configuration file defined as:

providers:
  - name: ProviderA
    region-name: region1
    cloud: ProviderA
    boot-timeout: 120
    diskimages:
      - name: centos
      - name: fedora
    pools:
      - name: main
        max-servers: 100
        labels:
          - name: centos
            min-ram: 8192
            flavor-name: Performance
            diskimage: centos
            key-name: root-key

Then you would need to alter the configuration to:

providers:
  - name: ProviderA
    region-name: region1
    cloud: ProviderA
    boot-timeout: 120
    diskimages:
      - name: centos
      - name: fedora
    pools:
      - name: main
        max-servers: 0
        labels:
          - name: centos
            min-ram: 8192
            flavor-name: Performance
            diskimage: centos
            key-name: root-key

Note

The launcher process will automatically notice any changes in its configuration file, so there is no need to restart the service to pick up the change.

Removing from the Builder

The builder controls image building, uploading, and on-disk cleanup. The builder needs a chance to properly manage these resources for a removed a provider. To do this, you need to first set the diskimage configuration section for the provider you want to remove to an empty list.

Warning

Make sure the provider is disabled in the launcher before disabling in the builder.

For example, if you want to remove ProviderA from a builder with a configuration file defined as:

providers:
  - name: ProviderA
    region-name: region1
    diskimages:
      - name: centos
      - name: fedora

diskimages:
  - name: centos
    pause: false
    elements:
      - centos-minimal
      ...
    env-vars:
      ...

Then you would need to alter the configuration to:

providers:
  - name: ProviderA
    region-name: region1
    diskimages: []

diskimages:
  - name: centos
    pause: false
    elements:
      - centos-minimal
      ...
    env-vars:
      ...

By keeping the provider defined in the configuration file, but changing the diskimages to an empty list, you signal the builder to cleanup resources for that provider, including any images already uploaded, any on-disk images, and any image data stored in ZooKeeper. After those resources have been cleaned up, it is safe to remove the provider from the configuration file entirely, if you wish to do so.

Note

The builder process will automatically notice any changes in its configuration file, so there is no need to restart the service to pick up the change.

Web interface

If configured (see webapp), a nodepool-launcher instance can provide a range of end-points that can provide information in text and json format. Note if there are multiple launchers, all will provide the same information.

The status of uploaded images

query fields

comma-separated list of fields to display

reqheader Accept

application/json or text/*

resheader Content-Type

application/json or text/plain depending on the :http:header:Accept header

The status of images built by diskimage-builder

query fields

comma-separated list of fields to display

reqheader Accept

application/json or text/*

resheader Content-Type

application/json or text/plain depending on the :http:header:Accept header

The paused and manual build status of images

query fields

comma-separated list of fields to display

reqheader Accept

application/json or text/*

resheader Content-Type

application/json or text/plain depending on the :http:header:Accept header

The status of currently active nodes

query node_id

restrict to a specific node

query fields

comma-separated list of fields to display

reqheader Accept

application/json or text/*

resheader Content-Type

application/json or text/plain depending on the :http:header:Accept header

Outstanding requests

query fields

comma-separated list of fields to display

reqheader Accept

application/json or text/*

resheader Content-Type

application/json or text/plain depending on the :http:header:Accept header

All available labels as reported by all launchers

query fields

comma-separated list of fields to display

reqheader Accept

application/json or text/*

resheader Content-Type

application/json or text/plain depending on the :http:header:Accept header

Responds with status code 200 as soon as all configured providers are fully started. During startup it returns 500. This can be used as a readiness probe in a kubernetes based deployment.

Monitoring

Nodepool provides monitoring information to statsd. See statsd_configuration to learn how to enable statsd support. Currently, these metrics are supported:

Nodepool builder

The following metrics are produced by a nodepool-builder process:

Nodepool launcher

The following metrics are produced by a nodepool-launcher process:

Provider Metrics

This hierarchy supplies driver-dependent information about leaked resource cleanup. Non-zero values indicate an error situation as resources should be cleaned up automatically.

Launch metrics

OpenStack API metrics

Low level details on the timing of OpenStack API calls will be logged by openstacksdk. These calls are logged under nodepool.task.<provider>.<api-call>. The API call name is of the generic format <service-type>.<method>.<operation>. For example, the GET /servers call to the compute service becomes compute.GET.servers.

Since these calls reflect the internal operations of the openstacksdk, the exact keys logged may vary across providers and releases.

Internal metrics

The following metrics are low-level performance metrics of the launcher itself, primarily of interest to Nodepool developers, and are subject to change in the future as development needs change: