[torqueusers] Jobs sitting in queue for no reason
garrick at clusterresources.com
Fri Sep 29 16:53:15 MDT 2006
On Fri, Sep 29, 2006 at 12:29:29AM -0400, Tim Miller alleged:
> Hi All,
> I have recently upgraded to Torque 2.1.2 with the C scheduler and am
> experiencing a very weird problem. Jobs will sit in the execution queue
> and not run, even though pbsnodes shows sufficient free nodes maching
> the job spec to run the job. I've done a fair bit of digging to try to
> find the root cause, and the problem seems to be in the code in
> node_manager.c for I get a lot of messages like "cannot allocate node
> n2.lobos.nih.gov to job ...".
> Adding in some custom debug code, it seems like the second for loop in
> node_spec picks the first node in the server's list of nodes, see it's
> not valid to run on (the node is busy), goes searching for a new node
> via the search function, fails to find one for some reason, and then
> dies out with an error message like the one above.
> I should note that this problem occurs irregularly. That is, things will
> work fine for a few hours, and then this problem will crop up and then
> go away on its own after a little while.
> Since I don't recall seeing anything else like this on the list, I
> wonder if maybe my configuration is a problem -- I'm pasting my server
> config below in case someone sees something dumb that I have done or
> failed to do.
> Server m1.lobos.nih.gov
> server_state = Active
> scheduling = True
> total_jobs = 51
> state_count = Transit:0 Queued:13 Held:0 Waiting:0 Running:38
> managers = <<user list deleted>>
> default_queue = entry
> log_events = 511
> mail_from = adm
> resources_assigned.nodect = 48
> scheduler_iteration = 600
> node_check_rate = 120
> tcp_timeout = 6
> pbs_version = 2.1
> I hope someone has some ideas since I'm tearing my hair out and going
> through the code in node_manager.c is somewhat tough sledding for
> someone not familiar with how this is supposed to work.
No doubt, this stuff is really complicated and very difficult to debug.
I'd run pbs_server in debug mode (set PBSDEBUG before running it), and
let it run for awhile in your terminal. You'll see a list of counts for
each subnode when jobs are scheduled. When you notice the problem
happening, look at those numbers and see if they make sense.
More information about the torqueusers