[torquedev] Problems starting job with torque 2.1.6

Åke Sandgren ake.sandgren at hpc2n.umu.se
Fri Jan 12 11:32:09 MST 2007


On Fri, 2007-01-12 at 11:02 -0500, Troy Baer wrote:
> On Fri, 2007-01-12 at 16:15 +0100, Åke Sandgren wrote:
> > I just had a semi-large job (90 nodes) fail to start due to masternode
> > not sending out the JOIN_JOB to all sisters or sister not receiving it
> > at least.
> > 
> > Anyone seen anything like this?
> 
> We've seen that a lot in OpenPBS, but not (yet) in TORQUE.  The failure
> mode in OpenPBS seems to be that the sister node has some degree of load
> on it and drops the JOIN JOB message, and then the mother superior never
> tries to send another one.

I've never seen this before either but we seldom have jobs as large as
this one so i got curious as to why it didn't start.

It shouldn't matter how much load the sister node has, it simply
shouldn't drop such a message.

Garrick? Any ideas?

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: ake at hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se



More information about the torquedev mailing list