[torqueusers] Jobs sitting in queue for no reason
btmiller at helix.nih.gov
Thu Sep 28 22:29:29 MDT 2006
I have recently upgraded to Torque 2.1.2 with the C scheduler and am
experiencing a very weird problem. Jobs will sit in the execution queue
and not run, even though pbsnodes shows sufficient free nodes maching
the job spec to run the job. I've done a fair bit of digging to try to
find the root cause, and the problem seems to be in the code in
node_manager.c for I get a lot of messages like "cannot allocate node
n2.lobos.nih.gov to job ...".
Adding in some custom debug code, it seems like the second for loop in
node_spec picks the first node in the server's list of nodes, see it's
not valid to run on (the node is busy), goes searching for a new node
via the search function, fails to find one for some reason, and then
dies out with an error message like the one above.
I should note that this problem occurs irregularly. That is, things will
work fine for a few hours, and then this problem will crop up and then
go away on its own after a little while.
Since I don't recall seeing anything else like this on the list, I
wonder if maybe my configuration is a problem -- I'm pasting my server
config below in case someone sees something dumb that I have done or
failed to do.
server_state = Active
scheduling = True
total_jobs = 51
state_count = Transit:0 Queued:13 Held:0 Waiting:0 Running:38
managers = <<user list deleted>>
default_queue = entry
log_events = 511
mail_from = adm
resources_assigned.nodect = 48
scheduler_iteration = 600
node_check_rate = 120
tcp_timeout = 6
pbs_version = 2.1
I hope someone has some ideas since I'm tearing my hair out and going
through the code in node_manager.c is somewhat tough sledding for
someone not familiar with how this is supposed to work.
More information about the torqueusers