[torqueusers] requested job die, code 1099

Garrick Staples garrick at clusterresources.com
Mon Sep 18 11:39:53 MDT 2006


On Fri, Sep 15, 2006 at 09:52:56AM -0500, brad mecklenburg alleged:
> We have two cluster systems reporting this problem:  128 node IBM Open Power
> 5 cluster with torque 1.2.0.p4 and maui 3.2.6p9. I know these are older
> versions but will get to that in a minute.
> 
> I have had a problem with random compute nodes reporting the following error
> after the pbs job was run using qsub. Mpirun is also used.
> 
> Job: 566.marvin
> 07/19/2006 14:15:15 M Job Modified at request of PBS_Server at marvin
> 07/19/2006 14:26:56 M node 42 (r05n18) requested job die, code 1099
> 
> The node reboots itself thus crashing the pbs job.  The 1099 error either
> causes the compute node to crash or the 1099 error is a result of the crash.
> Not positive on this one.  However, the compute nodes that receive this
> error are random.  The same compute node has not reported the problem.
> 
> So an upgrade in PBS was in order.  The following is what we have installed
> on our Apple test cluster.
> MPICH ? 1.2.7..1
> OSX - 10.4.7
> XSAN - 1.4
> PBS Torque - 2.1.2 
> MX Driver ? 1.1.4 w\ fma mapper
> MAUI ? 3.2.6 p16
> Mpiexec .81
> 
> We are currently testing this on our 128 Apple Xserve G5 test cluster before
> we implement the upgrade on our production system.
> 
> However, this morning I saw the same error as on our IBM cluster.  The error
> we see on this system is:
> PBS: job killed: node 42 (r01n46) requested job die, 'EOF' (code 1099) -
> internal or network failure attempting to communicate with sister MOM's
> 
> The job crashed and the node was unresponsive. A manual hard reboot was in
> order. 
> It seems to be a PBS communications problem but the system logs are not
> really giving any good information.
> 
> Does anyone have any ideas/suggestions on why this is occurring and how to
> resolve the problem?  Has anyone else seen these same types of errors and
> had success resolving the problem? Any input anyone may have will be
> appreciated. Thanks.

EOF errors would be the effect of the node lockups/reboots.  Any node
lockups, crashes, or reboots triggered by userspace software are
necessarily kernel bugs.



More information about the torqueusers mailing list