Handle invalid request lock in request cleanup

We have seen in production the case where node request handler cleanup fails due to being unable to cleanup a node request lock. This appears to happen because the node request lock has been invalidated somehow. The lock may or may not be valid in zookeeper, but the python object has been updated to None. This is a defenseive change to handle when that node request lock is None. This occurs when the node request has already been removed from zookeeper so it is safe to ignore this node request lock and move on with processing other requests in the handler. The node request lock clenaup step should clean out any leaks on the zookeeper side of things. Change-Id: I30e78bc67906d9ad97bf9ca964ea145c67526a8b
2017-12-06 16:57:07 -08:00 · 2017-12-06 16:57:07 -08:00 · f36fc13222
commit f36fc13222
parent 92451cdb73
2 changed files with 24 additions and 6 deletions
--- a/nodepool/driver/init.py
+++ b/nodepool/driver/init.py
@ -20,6 +20,7 @@ import abc
 import six

 from nodepool import zk
+from nodepool import exceptions


@six.add_metaclass(abc.ABCMeta)
@ -179,7 +180,15 @@ class NodeRequestHandler(object):
                node.allocated_to = None
                self.zk.storeNode(node)
            self.unlockNodeSet()
-            self.zk.unlockNodeRequest(self.request)
+            try:
+                self.zk.unlockNodeRequest(self.request)
+            except exceptions.ZKLockException:
+                # If the lock object is invalid that is "ok" since we no
+                # longer have a request either. Just do our best, log and
+                # move on.
+                self.log.debug("Request lock invalid for node request %s "
+                               "when attempting to clean up the lock",
+                               self.request.id)
            return True

        if self.launch_manager.failed_nodes:
--- a/nodepool/launcher.py
+++ b/nodepool/launcher.py
@ -217,10 +217,19 @@ class PoolWorker(threading.Thread):
        '''
        active_handlers = []
        for r in self.request_handlers:
-            if not r.poll():
+            try:
+                if not r.poll():
+                    active_handlers.append(r)
+                else:
+                    self.log.debug("Removing handler for request %s",
+                                   r.request.id)
+            except Exception:
+                # If we fail to poll a request handler log it but move on
+                # and process the other handlers. We keep this handler around
+                # and will try again later.
+                self.log.exception("Error polling request handler for "
+                                   "request %s", r.request.id)
                active_handlers.append(r)
-            else:
-                self.log.debug("Removing handler for request %s", r.request.id)
        self.request_handlers = active_handlers
        active_reqs = [r.request.id for r in self.request_handlers]
        self.log.debug("Active requests: %s", active_reqs)
@ -424,8 +433,8 @@ class CleanupWorker(BaseCleanupWorker):
        Because the node request locks are not direct children of the request
        znode, we need to remove the locks separately after the request has
        been processed. Only remove them after LOCK_CLEANUP seconds have
-        passed. This helps prevent the scenario where a request could go
-        away _while_ a lock is currently held for processing and the cleanup
+        passed. This helps reduce chances of the scenario where a request could
+        go away _while_ a lock is currently held for processing and the cleanup
        thread attempts to delete it. The delay should reduce the chance that
        we delete a currently held lock.
        '''