[torquedev] Problems starting job with torque 2.1.6
troy at osc.edu
Fri Jan 12 09:02:49 MST 2007
On Fri, 2007-01-12 at 16:15 +0100, Åke Sandgren wrote:
> I just had a semi-large job (90 nodes) fail to start due to masternode
> not sending out the JOIN_JOB to all sisters or sister not receiving it
> at least.
> Anyone seen anything like this?
We've seen that a lot in OpenPBS, but not (yet) in TORQUE. The failure
mode in OpenPBS seems to be that the sister node has some degree of load
on it and drops the JOIN JOB message, and then the mother superior never
tries to send another one.
Troy Baer troy at osc.edu
Science & Technology Support http://www.osc.edu/hpc/
Ohio Supercomputer Center 614-292-9701
More information about the torquedev