Steven Webster 5d1a26b89d Implement CNI cache file cleanup for stale files
It has been observed in systems running for months -> years
that the CNI cache files (representing attributes of
network attachment definitions of pods) can accumulate in
large numbers in the /var/lib/cni/results/ and
/var/lib/cni/multus/ directories.

The cache files in /var/lib/cni/results/ have a naming signature of:

<type>-<pod id>-<interface name>

While the cache files in /var/lib/cni/multus have a naming signature
of:

<pod id>

Normally these files are cleaned up automatically (I believe
this is the responsibility of containerd).  It has been seen
that this happens reliably when one manually deletes a pod.

The issue has been reproduced in the case of a host being manually
rebooted.  In this case, the pods are re-created when the host comes
back up, but with a different pod-id than was used before

In this case, _most_ of the time the cache files from the previous
instantiation of the pod are deleted, but occasionally a few are
missed by the internal garbage collection mechanism.

Once a cache file from the previous instantiation of a pod escapes
garbage collection, it seems to be left as a stale file for all
subsequent reboots.  Over time, this can cause these stale files
to accumulate and take up disk space unnecessarily.

The script will be called once by the k8s-pod-recovery service
on system startup, and then periodically via a cron job installed
by puppet.

The cleanup mechanism analyzes the cache files by name and
compares them with the id(s) of the currently running pods. Any
stale files detected are deleted.

Test Plan:

PASS: Verify existing pods do not have their cache files removed
PASS: Verify files younger than the specified 'olderthan' time
      are not removed
PASS: Verify stale cache files for pods that do not exist anymore
      are removed.
PASS: Verify the script does not run if kubelet is not up yet.

Failure Path:

PASS: Verify files not matching the naming signature (pod id
      embedded in file name) are not processed

Regression:

PASS: Verify system install
PASS: Verify feature logging

Partial-Bug: 1947386

Signed-off-by: Steven Webster <steven.webster@windriver.com>
Change-Id: I0ce06646001e52d1cc6d204b924f41d049264b4c
2021-11-01 10:39:39 -04:00
..