Handle invalid request lock in request cleanup

We have seen in production the case where node request handler cleanup
fails due to being unable to cleanup a node request lock. This appears
to happen because the node request lock has been invalidated somehow.
The lock may or may not be valid in zookeeper, but the python object has
been updated to None.

This is a defenseive change to handle when that node request lock is
None. This occurs when the node request has already been removed from
zookeeper so it is safe to ignore this node request lock and move on
with processing other requests in the handler. The node request lock
clenaup step should clean out any leaks on the zookeeper side of things.

Change-Id: I30e78bc67906d9ad97bf9ca964ea145c67526a8b
This commit is contained in:
Clark Boylan 2017-12-06 16:57:07 -08:00
parent 92451cdb73
commit f36fc13222
2 changed files with 24 additions and 6 deletions

View File

@ -20,6 +20,7 @@ import abc
import six
from nodepool import zk
from nodepool import exceptions
@six.add_metaclass(abc.ABCMeta)
@ -179,7 +180,15 @@ class NodeRequestHandler(object):
node.allocated_to = None
self.zk.storeNode(node)
self.unlockNodeSet()
try:
self.zk.unlockNodeRequest(self.request)
except exceptions.ZKLockException:
# If the lock object is invalid that is "ok" since we no
# longer have a request either. Just do our best, log and
# move on.
self.log.debug("Request lock invalid for node request %s "
"when attempting to clean up the lock",
self.request.id)
return True
if self.launch_manager.failed_nodes:

View File

@ -217,10 +217,19 @@ class PoolWorker(threading.Thread):
'''
active_handlers = []
for r in self.request_handlers:
try:
if not r.poll():
active_handlers.append(r)
else:
self.log.debug("Removing handler for request %s", r.request.id)
self.log.debug("Removing handler for request %s",
r.request.id)
except Exception:
# If we fail to poll a request handler log it but move on
# and process the other handlers. We keep this handler around
# and will try again later.
self.log.exception("Error polling request handler for "
"request %s", r.request.id)
active_handlers.append(r)
self.request_handlers = active_handlers
active_reqs = [r.request.id for r in self.request_handlers]
self.log.debug("Active requests: %s", active_reqs)
@ -424,8 +433,8 @@ class CleanupWorker(BaseCleanupWorker):
Because the node request locks are not direct children of the request
znode, we need to remove the locks separately after the request has
been processed. Only remove them after LOCK_CLEANUP seconds have
passed. This helps prevent the scenario where a request could go
away _while_ a lock is currently held for processing and the cleanup
passed. This helps reduce chances of the scenario where a request could
go away _while_ a lock is currently held for processing and the cleanup
thread attempts to delete it. The delay should reduce the chance that
we delete a currently held lock.
'''