[Mauiusers] Re: [torqueusers] Job eligible, nodes free, but job would not start

Neelesh Arora narora at Princeton.EDU
Wed Oct 18 13:49:19 MDT 2006



Garrick Staples wrote:
> On Fri, Oct 13, 2006 at 04:52:23PM -0400, Neelesh Arora alleged:
>> Garrick Staples wrote:
>>> On Thu, Oct 12, 2006 at 06:58:09PM -0400, Neelesh Arora alleged:
>>>> - There are several jobs in the queue that are in the Q state. When I do 
>>>> checkjob <jobid>, I get (among other things):
>>>> "job can run in partition DEFAULT (63 procs available.  1 procs required)"
>>>> but the job remains in Q forever. It is not the case of a resource 
>>>> requirement not being met (as the above message indicates)
>>> That means a reservation is set preventing the jobs from running.
>>>
>>>> - restarting torque and maui did not help either
>>> Look at the reservations preventing the job from running.
>>>
>> If I do showres, I get the expected reservations for the running jobs. 
>> By expected, I mean the number/name of nodes assigned to each job are as 
>> reported by qstat/checkjob. There is only one reservation for an idle job:
>> ReservationID       Type S       Start         End    Duration    N/P 
>>  StartTime
>> 88655                Job I    INFINITY    INFINITY    INFINITY    5/10 
>>  Mon Nov 12 15:52:32
>> and,
>> # showres -n|grep 88655
>> node015        Job              88655       Idle    2    INFINITY 
>> INFINITE  Mon Nov 12 15:52:32
>> node014        Job              88655       Idle    2    INFINITY 
>> INFINITE  Mon Nov 12 15:52:32
>> node010        Job              88655       Idle    2    INFINITY 
>> INFINITE  Mon Nov 12 15:52:32
>> node003        Job              88655       Idle    2    INFINITY 
>> INFINITE  Mon Nov 12 15:52:32
>> node002        Job              88655       Idle    2    INFINITY 
>> INFINITE  Mon Nov 12 15:52:32
>>
>> So, this probably means that no other job can start on these nodes. That 
>> still leaves 60+ nodes that have no reservations on them. Is there 
>> something else I am missing here?
> 
> You might need to increase RESERVATIONDEPTH, I have mine at 500.
> 

Indeed, increasing RESERVATIONDEPTH fixed the issue. All stuck jobs 
started running and there are more reservations for Idle jobs now.
Thanks.

Is there a good rule-of-thumb when deciding on the value for this 
parameter? Or like most things, one has to go through trial and error?

-Neel


More information about the torqueusers mailing list