[Mauiusers] Job cannot be started, 15062, Unknown node

Jan Ploski Jan.Ploski at offis.de
Tue Sep 25 10:22:40 MDT 2007


mauiusers-bounces at supercluster.org schrieb am 09/25/2007 12:18:23 PM:

> Hello,
> 
> I have a job stuck in front of the queue, apparently blocking all other, 

> lower-priority jobs from executing. checkjob reports the useless error 
> message "Unknown node" (see subject).
> 
> I tracked down the reason of the problem to an invalid nodelist 
> specification which is produced for the job by Maui. More precisely, 
this 
> is what I request:
> 
> nodes=18:ib:ppn=4+1:ib:ppn=2
> 
> and this is what Maui gives me:
> 
> node3:ppn=4+node4:ppn=8+node5:ppn=4+node6:ppn=4+node7:ppn=4+node8:
> ppn=4+node9:ppn=4+node10:ppn=4+node11:ppn=4+node12:ppn=4+node13:
> ppn=4+node14:ppn=4+node15:ppn=4+node16:ppn=4+node18:ppn=4+node20:
> ppn=4+node22:ppn=4+node17:ppn=2
> 
> If you sum up the ppn, you will notice that it tries to give me 4 
> processors more than requested (78 instead of 74). Morever, it tries to 
> give me node4:ppn=8 - even though node4 is configured with only 4 
> processors. This is why TORQUE rejects the job.
> 
> Now, I can debug the maui process and see that the TC is 8 instead of 4 
in 
> the job's NodeList, and I can also see that the nodes are allocated as 
> expected (TC=4) in the job's reqs, but I don't know where the NodeList 
of 
> the job comes from. I don't even know whether it is overwritten with the 

> wrong value on each scheduling cycle or whether it was computed once 
when 
> the job was created. I'd be grateful for some debugging tips from Maui 
> developers.

I got some additional information from my debugging sessions:

The invalid ppn for node4 is computed in the loop preceded by the comment
/* coallesce multi-req NodeList */ in MSched.c. The loop seems to be 
innocent,
but the problem is that node4 appears TWICE in J->NodeList - each time 
with
TC=4. Also, it already appears twice in J->Req[0]->NodeList, which is 
filled
in function MJobAllocMNL. This may be because it also appears twice in
MFeasibleList[0] passed to this function. While I am pretty sure that
the node should NOT be included twice in J->Req[0]->NodeList, I am not so
sure about MFeasibleList. Any suggestions on what I should check next
to track this down?

Regards,
Jan Ploski


More information about the mauiusers mailing list