[torqueusers] Jobs sitting in queue for no reason

Tim Miller btmiller at helix.nih.gov
Thu Sep 28 22:29:29 MDT 2006

Hi All,

I have recently upgraded to Torque 2.1.2 with the C scheduler and am 
experiencing a very weird problem.  Jobs will sit in the execution queue 
and not run, even though pbsnodes shows sufficient free nodes maching 
the job spec to run the job. I've done a fair bit of digging to try to 
find the root cause, and the problem seems to be in the code in 
node_manager.c for I get a lot of messages like "cannot allocate node 
n2.lobos.nih.gov to job ...".

Adding in some custom debug code, it seems like the second for loop in 
node_spec picks the first node in the server's list of nodes, see it's 
not valid to run on (the node is busy), goes searching for a new node 
via the search function, fails to find one for some reason, and then 
dies out with an error message like the one above.

I should note that this problem occurs irregularly. That is, things will 
work fine for a few hours, and then this problem will crop up and then 
go away on its own after a little while.

Since I don't recall seeing anything else like this on the list, I 
wonder if maybe my configuration is a problem -- I'm pasting my server 
config below in case someone sees something dumb that I have done or 
failed to do.

Server m1.lobos.nih.gov
        server_state = Active
        scheduling = True
        total_jobs = 51
        state_count = Transit:0 Queued:13 Held:0 Waiting:0 Running:38 
        managers = <<user list deleted>>
        default_queue = entry
        log_events = 511
        mail_from = adm
        resources_assigned.nodect = 48
        scheduler_iteration = 600
        node_check_rate = 120
        tcp_timeout = 6
        pbs_version = 2.1

I hope someone has some ideas since I'm tearing my hair out and going 
through the code in node_manager.c is somewhat tough sledding for 
someone not familiar with how this is supposed to work.


More information about the torqueusers mailing list