4f9a8b85c2
This change fixes two issues with the Isolated CPUs plugin. 1. Isolated CPU plugin systemd service does not start in the first attempt following kubelet start. 2. Kubelet has intermittent communcation failure with isolcpus_plugin, hence reports 0 allocatable isolated CPU devices. The plugin communicates with the kubelet using RPC server at /var/lib/kubelet/device-plugins/kubelet.sock, whereas kubelet communicates with the plugin using socket file /var/lib/kubelet/device-plugins/windriver.com-isolcpus.sock. As per kubernetes' guidelines, plugin watches for removal or renaming of file windriver.com-isolcpus.sock and restarts itself in such an event. Following events take place in the kubelet and the plugin after they are started: Plugin: 1. Create socket file windriver.com-isolcpus.sock. 2. Start serving on the socket file. 3. Register itself with the kubelet. 4. Start a watch on the socket file. Kubelet: (events related to device plugin manager only) 1. Start device plugin registration server and wipes out /var/lib/kubelet/device-plugins/windriver.com-isolcpus.sock and /var/lib/kubelet/device-plugins/kubelet.sock. 2. Create kubelet.sock and start serving on it. 3. Register a plugin upon registration request. 4. Request device information to the plugin. In a production environment, kubelet startup time varies and above events when intermixed can take place in any sequence in time. Plugin event 3 happening before kubelet event 2 causes plugin to fail and is the root cause of the 1st issue mentioned above. Plugin event 1 and 2 -> kubelet event 1 and 2 -> plugin event 3 -> kubelet event 3 and 4 causes kubelet to not find file windriver.com-isolcpus.sock and causes 2nd issue mentioned above. This change adds a wait to the isolcpu_plugin for the kubelet.sock to be ready. This ensures that plugin directory wipe has completed and is serving kubelet.sock hence fixing both the issues mentioned above. Test Plan: On AIO-SX: Pre-requisite: The label kube-cpu-mgr-policy=static is assigned to the host with some CPUs reserved as application-isolated. PASS: Restart kubelet and check windriver.com/isolcpus device capacity and allocatables are updated correctly. (kubectl describe node) PASS: Restart isol. CPU plugin and check windriver.com/isolcpus device capacity and allocatables are updated correctly. (kubectl describe node) PASS: Reboot controller and check windriver.com/isolcpus device capacity and allocatables are updated correctly. (kubectl describe node) PASS: Controller lock/unlock and and check windriver.com/isolcpus device capacity and allocatables are updated correctly. (kubectl describe node) PASS: Remove file /var/lib/kubelet/device-plugins/windriver.com-isolcpus.sock and check windriver.com/isolcpus device capacity and allocatables are updated correctly. (kubectl describe node) PASS: Rename file /var/lib/kubelet/device-plugins/windriver.com-isolcpus.sock and check windriver.com/isolcpus device capacity and allocatables are updated correctly. (kubectl describe node) Note: Kubelet was patched to add a debug log after kubelet event 2 above. The log always appeared before the 'connection test success' info log in this change for all of the above test cases. This issue is hard to reproduce without patching kubelet and the plugin binary. So the fix cannot be verified against the failure but can be better confirmed through the log events. Closes-Bug: 2064777 Change-Id: I9645af7609cab8703fe22e05125fbf2fcfb2d20c Signed-off-by: Kaustubh Dhokte <kaustubh.dhokte@windriver.com> |
||
---|---|---|
.. | ||
intel/intel-device-plugins-for-kubernetes/pkg | ||
kubernetes/pkg/kubelet | ||
vendor | ||
go.mod | ||
go.sum | ||
isolcpu_plugin.conf | ||
isolcpu_plugin.service | ||
isolcpu.go | ||
LICENSE | ||
README.md |
Isolated CPUs Device Plugin for Kubernetes
About
This code implements a Kubernetes device plugin. The plugin detects all CPUs specified via "isolcpus=X" in the kernel boot args, and exports them to Kubernetes as custom devices using the deviceplugin API.
It makes heavy use of the Intel device plugin manager from github.com/intel/intel-device-plugins-for-kubernetes and credit is due to them for making a useful helper. A good example of how to use that framework can be found at https://github.com/intel/intel-device-plugins-for-kubernetes/blob/master/cmd/gpu_plugin/gpu_plugin.go
Implementation Notes
There are currently problems with using go modules for the deviceplugin API...it leads to an "go: error loading module requirements" error when running "go build". Accordingly, it was necessary to copy a number of files from external packages. As part of this work I also updated the deviceplugin API files to the latest versions to pick up in-development upstream changes.
The "intel/intel-device-plugins-for-kubernetes" subdirectory corresponds to "github.com/intel/intel-device-plugins-for-kubernetes".
The "kubernetes" subdirectory corresponds to "k8s.io/kubernetes"
In an ideal world, these two subdirectories would not be needed, and instead we would simply include the following imports in isolcpu.go:
"github.com/intel/intel-device-plugins-for-kubernetes/pkg/debug"
dpapi "github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin"
pluginapi "k8s.io/kubernetes/pkg/kubelet/apis/deviceplugin/v1beta1"
"k8s.io/kubernetes/pkg/kubelet/cm/cpuset"
This would also require updating the Intel package to pick up the latest deviceplugin API so that the topology field is properly represented.
Build Notes
In order to avoid the need for a network connection to download dependencies at build time, I've chosen to include all the dependencies in the "vendor" directory. This is auto-generated by running "go mod vendor". The binary is then built with "go build -mod=vendor".