[torqueusers] Jobs sitting in queue for no reason

Garrick Staples garrick at clusterresources.com
Fri Sep 29 15:06:59 MDT 2006


On Fri, Sep 29, 2006 at 12:29:29AM -0400, Tim Miller alleged:
> Hi All,
> 
> I have recently upgraded to Torque 2.1.2 with the C scheduler and am 
> experiencing a very weird problem.  Jobs will sit in the execution queue 
> and not run, even though pbsnodes shows sufficient free nodes maching 
> the job spec to run the job. I've done a fair bit of digging to try to 
> find the root cause, and the problem seems to be in the code in 
> node_manager.c for I get a lot of messages like "cannot allocate node 
> n2.lobos.nih.gov to job ...".
> 
> Adding in some custom debug code, it seems like the second for loop in 
> node_spec picks the first node in the server's list of nodes, see it's 
> not valid to run on (the node is busy), goes searching for a new node 
> via the search function, fails to find one for some reason, and then 
> dies out with an error message like the one above.
> 
> I should note that this problem occurs irregularly. That is, things will 
> work fine for a few hours, and then this problem will crop up and then 
> go away on its own after a little while.
> 
> Since I don't recall seeing anything else like this on the list, I 
> wonder if maybe my configuration is a problem -- I'm pasting my server 
> config below in case someone sees something dumb that I have done or 
> failed to do.
> 
> Server m1.lobos.nih.gov
>        server_state = Active
>        scheduling = True
>        total_jobs = 51
>        state_count = Transit:0 Queued:13 Held:0 Waiting:0 Running:38 
> Exiting:0
>        managers = <<user list deleted>>
>        default_queue = entry
>        log_events = 511
>        mail_from = adm
>        resources_assigned.nodect = 48
>        scheduler_iteration = 600
>        node_check_rate = 120
>        tcp_timeout = 6
>        pbs_version = 2.1
> 
> I hope someone has some ideas since I'm tearing my hair out and going 
> through the code in node_manager.c is somewhat tough sledding for 
> someone not familiar with how this is supposed to work.

"pbs_version = 2.1"?  Where did you get this build from?  At the very
least, that should always be a 3 digit number, plus option extra stuff.



More information about the torqueusers mailing list