4f9a8b85c2
This change fixes two issues with the Isolated CPUs plugin. 1. Isolated CPU plugin systemd service does not start in the first attempt following kubelet start. 2. Kubelet has intermittent communcation failure with isolcpus_plugin, hence reports 0 allocatable isolated CPU devices. The plugin communicates with the kubelet using RPC server at /var/lib/kubelet/device-plugins/kubelet.sock, whereas kubelet communicates with the plugin using socket file /var/lib/kubelet/device-plugins/windriver.com-isolcpus.sock. As per kubernetes' guidelines, plugin watches for removal or renaming of file windriver.com-isolcpus.sock and restarts itself in such an event. Following events take place in the kubelet and the plugin after they are started: Plugin: 1. Create socket file windriver.com-isolcpus.sock. 2. Start serving on the socket file. 3. Register itself with the kubelet. 4. Start a watch on the socket file. Kubelet: (events related to device plugin manager only) 1. Start device plugin registration server and wipes out /var/lib/kubelet/device-plugins/windriver.com-isolcpus.sock and /var/lib/kubelet/device-plugins/kubelet.sock. 2. Create kubelet.sock and start serving on it. 3. Register a plugin upon registration request. 4. Request device information to the plugin. In a production environment, kubelet startup time varies and above events when intermixed can take place in any sequence in time. Plugin event 3 happening before kubelet event 2 causes plugin to fail and is the root cause of the 1st issue mentioned above. Plugin event 1 and 2 -> kubelet event 1 and 2 -> plugin event 3 -> kubelet event 3 and 4 causes kubelet to not find file windriver.com-isolcpus.sock and causes 2nd issue mentioned above. This change adds a wait to the isolcpu_plugin for the kubelet.sock to be ready. This ensures that plugin directory wipe has completed and is serving kubelet.sock hence fixing both the issues mentioned above. Test Plan: On AIO-SX: Pre-requisite: The label kube-cpu-mgr-policy=static is assigned to the host with some CPUs reserved as application-isolated. PASS: Restart kubelet and check windriver.com/isolcpus device capacity and allocatables are updated correctly. (kubectl describe node) PASS: Restart isol. CPU plugin and check windriver.com/isolcpus device capacity and allocatables are updated correctly. (kubectl describe node) PASS: Reboot controller and check windriver.com/isolcpus device capacity and allocatables are updated correctly. (kubectl describe node) PASS: Controller lock/unlock and and check windriver.com/isolcpus device capacity and allocatables are updated correctly. (kubectl describe node) PASS: Remove file /var/lib/kubelet/device-plugins/windriver.com-isolcpus.sock and check windriver.com/isolcpus device capacity and allocatables are updated correctly. (kubectl describe node) PASS: Rename file /var/lib/kubelet/device-plugins/windriver.com-isolcpus.sock and check windriver.com/isolcpus device capacity and allocatables are updated correctly. (kubectl describe node) Note: Kubelet was patched to add a debug log after kubelet event 2 above. The log always appeared before the 'connection test success' info log in this change for all of the above test cases. This issue is hard to reproduce without patching kubelet and the plugin binary. So the fix cannot be verified against the failure but can be better confirmed through the log events. Closes-Bug: 2064777 Change-Id: I9645af7609cab8703fe22e05125fbf2fcfb2d20c Signed-off-by: Kaustubh Dhokte <kaustubh.dhokte@windriver.com>