[torquedev] Problems starting job with torque 2.1.6

Troy Baer troy at osc.edu
Fri Jan 12 09:02:49 MST 2007


On Fri, 2007-01-12 at 16:15 +0100, Åke Sandgren wrote:
> I just had a semi-large job (90 nodes) fail to start due to masternode
> not sending out the JOIN_JOB to all sisters or sister not receiving it
> at least.
> 
> Anyone seen anything like this?

We've seen that a lot in OpenPBS, but not (yet) in TORQUE.  The failure
mode in OpenPBS seems to be that the sister node has some degree of load
on it and drops the JOIN JOB message, and then the mother superior never
tries to send another one.

	--Troy
-- 
Troy Baer                       troy at osc.edu
Science & Technology Support    http://www.osc.edu/hpc/
Ohio Supercomputer Center       614-292-9701




More information about the torquedev mailing list