[Mauiusers] Re: [torqueusers] Job eligible, nodes free, but job
would not start
Neelesh Arora
narora at Princeton.EDU
Wed Oct 18 13:49:19 MDT 2006
Garrick Staples wrote:
> On Fri, Oct 13, 2006 at 04:52:23PM -0400, Neelesh Arora alleged:
>> Garrick Staples wrote:
>>> On Thu, Oct 12, 2006 at 06:58:09PM -0400, Neelesh Arora alleged:
>>>> - There are several jobs in the queue that are in the Q state. When I do
>>>> checkjob <jobid>, I get (among other things):
>>>> "job can run in partition DEFAULT (63 procs available. 1 procs required)"
>>>> but the job remains in Q forever. It is not the case of a resource
>>>> requirement not being met (as the above message indicates)
>>> That means a reservation is set preventing the jobs from running.
>>>
>>>> - restarting torque and maui did not help either
>>> Look at the reservations preventing the job from running.
>>>
>> If I do showres, I get the expected reservations for the running jobs.
>> By expected, I mean the number/name of nodes assigned to each job are as
>> reported by qstat/checkjob. There is only one reservation for an idle job:
>> ReservationID Type S Start End Duration N/P
>> StartTime
>> 88655 Job I INFINITY INFINITY INFINITY 5/10
>> Mon Nov 12 15:52:32
>> and,
>> # showres -n|grep 88655
>> node015 Job 88655 Idle 2 INFINITY
>> INFINITE Mon Nov 12 15:52:32
>> node014 Job 88655 Idle 2 INFINITY
>> INFINITE Mon Nov 12 15:52:32
>> node010 Job 88655 Idle 2 INFINITY
>> INFINITE Mon Nov 12 15:52:32
>> node003 Job 88655 Idle 2 INFINITY
>> INFINITE Mon Nov 12 15:52:32
>> node002 Job 88655 Idle 2 INFINITY
>> INFINITE Mon Nov 12 15:52:32
>>
>> So, this probably means that no other job can start on these nodes. That
>> still leaves 60+ nodes that have no reservations on them. Is there
>> something else I am missing here?
>
> You might need to increase RESERVATIONDEPTH, I have mine at 500.
>
Indeed, increasing RESERVATIONDEPTH fixed the issue. All stuck jobs
started running and there are more reservations for Idle jobs now.
Thanks.
Is there a good rule-of-thumb when deciding on the value for this
parameter? Or like most things, one has to go through trial and error?
-Neel
More information about the mauiusers
mailing list