Handle invalid request lock in request cleanup
We have seen in production the case where node request handler cleanup fails due to being unable to cleanup a node request lock. This appears to happen because the node request lock has been invalidated somehow. The lock may or may not be valid in zookeeper, but the python object has been updated to None. This is a defenseive change to handle when that node request lock is None. This occurs when the node request has already been removed from zookeeper so it is safe to ignore this node request lock and move on with processing other requests in the handler. The node request lock clenaup step should clean out any leaks on the zookeeper side of things. Change-Id: I30e78bc67906d9ad97bf9ca964ea145c67526a8b
This commit is contained in:
parent
92451cdb73
commit
f36fc13222
@ -20,6 +20,7 @@ import abc
|
||||
import six
|
||||
|
||||
from nodepool import zk
|
||||
from nodepool import exceptions
|
||||
|
||||
|
||||
@six.add_metaclass(abc.ABCMeta)
|
||||
@ -179,7 +180,15 @@ class NodeRequestHandler(object):
|
||||
node.allocated_to = None
|
||||
self.zk.storeNode(node)
|
||||
self.unlockNodeSet()
|
||||
self.zk.unlockNodeRequest(self.request)
|
||||
try:
|
||||
self.zk.unlockNodeRequest(self.request)
|
||||
except exceptions.ZKLockException:
|
||||
# If the lock object is invalid that is "ok" since we no
|
||||
# longer have a request either. Just do our best, log and
|
||||
# move on.
|
||||
self.log.debug("Request lock invalid for node request %s "
|
||||
"when attempting to clean up the lock",
|
||||
self.request.id)
|
||||
return True
|
||||
|
||||
if self.launch_manager.failed_nodes:
|
||||
|
@ -217,10 +217,19 @@ class PoolWorker(threading.Thread):
|
||||
'''
|
||||
active_handlers = []
|
||||
for r in self.request_handlers:
|
||||
if not r.poll():
|
||||
try:
|
||||
if not r.poll():
|
||||
active_handlers.append(r)
|
||||
else:
|
||||
self.log.debug("Removing handler for request %s",
|
||||
r.request.id)
|
||||
except Exception:
|
||||
# If we fail to poll a request handler log it but move on
|
||||
# and process the other handlers. We keep this handler around
|
||||
# and will try again later.
|
||||
self.log.exception("Error polling request handler for "
|
||||
"request %s", r.request.id)
|
||||
active_handlers.append(r)
|
||||
else:
|
||||
self.log.debug("Removing handler for request %s", r.request.id)
|
||||
self.request_handlers = active_handlers
|
||||
active_reqs = [r.request.id for r in self.request_handlers]
|
||||
self.log.debug("Active requests: %s", active_reqs)
|
||||
@ -424,8 +433,8 @@ class CleanupWorker(BaseCleanupWorker):
|
||||
Because the node request locks are not direct children of the request
|
||||
znode, we need to remove the locks separately after the request has
|
||||
been processed. Only remove them after LOCK_CLEANUP seconds have
|
||||
passed. This helps prevent the scenario where a request could go
|
||||
away _while_ a lock is currently held for processing and the cleanup
|
||||
passed. This helps reduce chances of the scenario where a request could
|
||||
go away _while_ a lock is currently held for processing and the cleanup
|
||||
thread attempts to delete it. The delay should reduce the chance that
|
||||
we delete a currently held lock.
|
||||
'''
|
||||
|
Loading…
x
Reference in New Issue
Block a user