integ/kubernetes/plugins/isolcpus-device-plugin/files/intel/intel-device-plugins-for-kubernetes
Kaustubh Dhokte 4f9a8b85c2 isolcpu_plugin: wait for kubelet.sock to be ready
This change fixes two issues with the Isolated CPUs plugin.
1. Isolated CPU plugin systemd service does not start in the first
   attempt following kubelet start.
2. Kubelet has intermittent communcation failure with
   isolcpus_plugin, hence reports 0 allocatable isolated CPU devices.

The plugin communicates with the kubelet using RPC server at
/var/lib/kubelet/device-plugins/kubelet.sock, whereas kubelet
communicates with the plugin using socket file
/var/lib/kubelet/device-plugins/windriver.com-isolcpus.sock.
As per kubernetes' guidelines, plugin watches for removal or
renaming of file windriver.com-isolcpus.sock and restarts itself
in such an event.

Following events take place in the kubelet and the plugin after
they are started:
Plugin:
1. Create socket file windriver.com-isolcpus.sock.
2. Start serving on the socket file.
3. Register itself with the kubelet.
4. Start a watch on the socket file.

Kubelet: (events related to device plugin manager only)
1. Start device plugin registration server and wipes out
   /var/lib/kubelet/device-plugins/windriver.com-isolcpus.sock
   and /var/lib/kubelet/device-plugins/kubelet.sock.
2. Create kubelet.sock and start serving on it.
3. Register a plugin upon registration request.
4. Request device information to the plugin.

In a production environment, kubelet startup time varies and above
events when intermixed can take place in any sequence in time.

Plugin event 3 happening before kubelet event 2 causes plugin
to fail and is the root cause of the 1st issue mentioned above.

Plugin event 1 and 2 -> kubelet event 1 and 2 -> plugin event 3
-> kubelet event 3 and 4 causes kubelet to not find file
windriver.com-isolcpus.sock and causes 2nd issue mentioned above.

This change adds a wait to the isolcpu_plugin for the kubelet.sock
to be ready. This ensures that plugin directory wipe has completed
and is serving kubelet.sock hence fixing both the issues mentioned
above.

Test Plan:
On AIO-SX:
Pre-requisite: The label kube-cpu-mgr-policy=static is assigned to
               the host with some CPUs reserved as
               application-isolated.
PASS: Restart kubelet and check windriver.com/isolcpus device
      capacity and allocatables are updated correctly.
      (kubectl describe node)
PASS: Restart isol. CPU plugin and check windriver.com/isolcpus
      device capacity and allocatables are updated correctly.
      (kubectl describe node)
PASS: Reboot controller and check windriver.com/isolcpus
      device capacity and allocatables are updated correctly.
      (kubectl describe node)
PASS: Controller lock/unlock and and check windriver.com/isolcpus
      device capacity and allocatables are updated correctly.
      (kubectl describe node)
PASS: Remove file
      /var/lib/kubelet/device-plugins/windriver.com-isolcpus.sock
      and check windriver.com/isolcpus device capacity and
      allocatables are updated correctly.
      (kubectl describe node)
PASS: Rename file
      /var/lib/kubelet/device-plugins/windriver.com-isolcpus.sock
      and check windriver.com/isolcpus device capacity and
      allocatables are updated correctly.
      (kubectl describe node)

Note: Kubelet was patched to add a debug log after kubelet event 2
      above. The log always appeared before the
      'connection test success' info log in this change for all of
      the above test cases. This issue is hard to reproduce without
      patching kubelet and the plugin binary. So the fix cannot be
      verified against the failure but can be better confirmed
      through the log events.

Closes-Bug: 2064777

Change-Id: I9645af7609cab8703fe22e05125fbf2fcfb2d20c
Signed-off-by: Kaustubh Dhokte <kaustubh.dhokte@windriver.com>
2024-05-06 19:06:43 +00:00
..