[Mauiusers] Job cannot be started, 15062, Unknown node
Jan Ploski
Jan.Ploski at offis.de
Tue Sep 25 10:22:40 MDT 2007
mauiusers-bounces at supercluster.org schrieb am 09/25/2007 12:18:23 PM:
> Hello,
>
> I have a job stuck in front of the queue, apparently blocking all other,
> lower-priority jobs from executing. checkjob reports the useless error
> message "Unknown node" (see subject).
>
> I tracked down the reason of the problem to an invalid nodelist
> specification which is produced for the job by Maui. More precisely,
this
> is what I request:
>
> nodes=18:ib:ppn=4+1:ib:ppn=2
>
> and this is what Maui gives me:
>
> node3:ppn=4+node4:ppn=8+node5:ppn=4+node6:ppn=4+node7:ppn=4+node8:
> ppn=4+node9:ppn=4+node10:ppn=4+node11:ppn=4+node12:ppn=4+node13:
> ppn=4+node14:ppn=4+node15:ppn=4+node16:ppn=4+node18:ppn=4+node20:
> ppn=4+node22:ppn=4+node17:ppn=2
>
> If you sum up the ppn, you will notice that it tries to give me 4
> processors more than requested (78 instead of 74). Morever, it tries to
> give me node4:ppn=8 - even though node4 is configured with only 4
> processors. This is why TORQUE rejects the job.
>
> Now, I can debug the maui process and see that the TC is 8 instead of 4
in
> the job's NodeList, and I can also see that the nodes are allocated as
> expected (TC=4) in the job's reqs, but I don't know where the NodeList
of
> the job comes from. I don't even know whether it is overwritten with the
> wrong value on each scheduling cycle or whether it was computed once
when
> the job was created. I'd be grateful for some debugging tips from Maui
> developers.
I got some additional information from my debugging sessions:
The invalid ppn for node4 is computed in the loop preceded by the comment
/* coallesce multi-req NodeList */ in MSched.c. The loop seems to be
innocent,
but the problem is that node4 appears TWICE in J->NodeList - each time
with
TC=4. Also, it already appears twice in J->Req[0]->NodeList, which is
filled
in function MJobAllocMNL. This may be because it also appears twice in
MFeasibleList[0] passed to this function. While I am pretty sure that
the node should NOT be included twice in J->Req[0]->NodeList, I am not so
sure about MFeasibleList. Any suggestions on what I should check next
to track this down?
Regards,
Jan Ploski
More information about the mauiusers
mailing list