integ/kubernetes/plugins/isolcpus-device-plugin/files
Kaustubh Dhokte 4f9a8b85c2 isolcpu_plugin: wait for kubelet.sock to be ready
This change fixes two issues with the Isolated CPUs plugin.
1. Isolated CPU plugin systemd service does not start in the first
   attempt following kubelet start.
2. Kubelet has intermittent communcation failure with
   isolcpus_plugin, hence reports 0 allocatable isolated CPU devices.

The plugin communicates with the kubelet using RPC server at
/var/lib/kubelet/device-plugins/kubelet.sock, whereas kubelet
communicates with the plugin using socket file
/var/lib/kubelet/device-plugins/windriver.com-isolcpus.sock.
As per kubernetes' guidelines, plugin watches for removal or
renaming of file windriver.com-isolcpus.sock and restarts itself
in such an event.

Following events take place in the kubelet and the plugin after
they are started:
Plugin:
1. Create socket file windriver.com-isolcpus.sock.
2. Start serving on the socket file.
3. Register itself with the kubelet.
4. Start a watch on the socket file.

Kubelet: (events related to device plugin manager only)
1. Start device plugin registration server and wipes out
   /var/lib/kubelet/device-plugins/windriver.com-isolcpus.sock
   and /var/lib/kubelet/device-plugins/kubelet.sock.
2. Create kubelet.sock and start serving on it.
3. Register a plugin upon registration request.
4. Request device information to the plugin.

In a production environment, kubelet startup time varies and above
events when intermixed can take place in any sequence in time.

Plugin event 3 happening before kubelet event 2 causes plugin
to fail and is the root cause of the 1st issue mentioned above.

Plugin event 1 and 2 -> kubelet event 1 and 2 -> plugin event 3
-> kubelet event 3 and 4 causes kubelet to not find file
windriver.com-isolcpus.sock and causes 2nd issue mentioned above.

This change adds a wait to the isolcpu_plugin for the kubelet.sock
to be ready. This ensures that plugin directory wipe has completed
and is serving kubelet.sock hence fixing both the issues mentioned
above.

Test Plan:
On AIO-SX:
Pre-requisite: The label kube-cpu-mgr-policy=static is assigned to
               the host with some CPUs reserved as
               application-isolated.
PASS: Restart kubelet and check windriver.com/isolcpus device
      capacity and allocatables are updated correctly.
      (kubectl describe node)
PASS: Restart isol. CPU plugin and check windriver.com/isolcpus
      device capacity and allocatables are updated correctly.
      (kubectl describe node)
PASS: Reboot controller and check windriver.com/isolcpus
      device capacity and allocatables are updated correctly.
      (kubectl describe node)
PASS: Controller lock/unlock and and check windriver.com/isolcpus
      device capacity and allocatables are updated correctly.
      (kubectl describe node)
PASS: Remove file
      /var/lib/kubelet/device-plugins/windriver.com-isolcpus.sock
      and check windriver.com/isolcpus device capacity and
      allocatables are updated correctly.
      (kubectl describe node)
PASS: Rename file
      /var/lib/kubelet/device-plugins/windriver.com-isolcpus.sock
      and check windriver.com/isolcpus device capacity and
      allocatables are updated correctly.
      (kubectl describe node)

Note: Kubelet was patched to add a debug log after kubelet event 2
      above. The log always appeared before the
      'connection test success' info log in this change for all of
      the above test cases. This issue is hard to reproduce without
      patching kubelet and the plugin binary. So the fix cannot be
      verified against the failure but can be better confirmed
      through the log events.

Closes-Bug: 2064777

Change-Id: I9645af7609cab8703fe22e05125fbf2fcfb2d20c
Signed-off-by: Kaustubh Dhokte <kaustubh.dhokte@windriver.com>
2024-05-06 19:06:43 +00:00
..
intel/intel-device-plugins-for-kubernetes/pkg isolcpu_plugin: wait for kubelet.sock to be ready 2024-05-06 19:06:43 +00:00
kubernetes/pkg/kubelet add isolcpus device plugin for kubernetes 2021-04-01 11:10:09 -06:00
vendor Fix lint errors identified by Zuul pylint job 2023-03-15 12:07:17 +00:00
go.mod add isolcpus device plugin for kubernetes 2021-04-01 11:10:09 -06:00
go.sum add isolcpus device plugin for kubernetes 2021-04-01 11:10:09 -06:00
isolcpu_plugin.conf add isolcpus device plugin for kubernetes 2021-04-01 11:10:09 -06:00
isolcpu_plugin.service add isolcpus device plugin for kubernetes 2021-04-01 11:10:09 -06:00
isolcpu.go add isolcpus device plugin for kubernetes 2021-04-01 11:10:09 -06:00
LICENSE add isolcpus device plugin for kubernetes 2021-04-01 11:10:09 -06:00
README.md add isolcpus device plugin for kubernetes 2021-04-01 11:10:09 -06:00

Isolated CPUs Device Plugin for Kubernetes

About

This code implements a Kubernetes device plugin. The plugin detects all CPUs specified via "isolcpus=X" in the kernel boot args, and exports them to Kubernetes as custom devices using the deviceplugin API.

It makes heavy use of the Intel device plugin manager from github.com/intel/intel-device-plugins-for-kubernetes and credit is due to them for making a useful helper. A good example of how to use that framework can be found at https://github.com/intel/intel-device-plugins-for-kubernetes/blob/master/cmd/gpu_plugin/gpu_plugin.go

Implementation Notes

There are currently problems with using go modules for the deviceplugin API...it leads to an "go: error loading module requirements" error when running "go build". Accordingly, it was necessary to copy a number of files from external packages. As part of this work I also updated the deviceplugin API files to the latest versions to pick up in-development upstream changes.

The "intel/intel-device-plugins-for-kubernetes" subdirectory corresponds to "github.com/intel/intel-device-plugins-for-kubernetes".

The "kubernetes" subdirectory corresponds to "k8s.io/kubernetes"

In an ideal world, these two subdirectories would not be needed, and instead we would simply include the following imports in isolcpu.go:

"github.com/intel/intel-device-plugins-for-kubernetes/pkg/debug"
dpapi "github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin"
pluginapi "k8s.io/kubernetes/pkg/kubelet/apis/deviceplugin/v1beta1"
"k8s.io/kubernetes/pkg/kubelet/cm/cpuset"

This would also require updating the Intel package to pick up the latest deviceplugin API so that the topology field is properly represented.

Build Notes

In order to avoid the need for a network connection to download dependencies at build time, I've chosen to include all the dependencies in the "vendor" directory. This is auto-generated by running "go mod vendor". The binary is then built with "go build -mod=vendor".