[torqueusers] requested job die, code 1099
Garrick Staples
garrick at clusterresources.com
Mon Sep 18 11:39:53 MDT 2006
On Fri, Sep 15, 2006 at 09:52:56AM -0500, brad mecklenburg alleged:
> We have two cluster systems reporting this problem: 128 node IBM Open Power
> 5 cluster with torque 1.2.0.p4 and maui 3.2.6p9. I know these are older
> versions but will get to that in a minute.
>
> I have had a problem with random compute nodes reporting the following error
> after the pbs job was run using qsub. Mpirun is also used.
>
> Job: 566.marvin
> 07/19/2006 14:15:15 M Job Modified at request of PBS_Server at marvin
> 07/19/2006 14:26:56 M node 42 (r05n18) requested job die, code 1099
>
> The node reboots itself thus crashing the pbs job. The 1099 error either
> causes the compute node to crash or the 1099 error is a result of the crash.
> Not positive on this one. However, the compute nodes that receive this
> error are random. The same compute node has not reported the problem.
>
> So an upgrade in PBS was in order. The following is what we have installed
> on our Apple test cluster.
> MPICH ? 1.2.7..1
> OSX - 10.4.7
> XSAN - 1.4
> PBS Torque - 2.1.2
> MX Driver ? 1.1.4 w\ fma mapper
> MAUI ? 3.2.6 p16
> Mpiexec .81
>
> We are currently testing this on our 128 Apple Xserve G5 test cluster before
> we implement the upgrade on our production system.
>
> However, this morning I saw the same error as on our IBM cluster. The error
> we see on this system is:
> PBS: job killed: node 42 (r01n46) requested job die, 'EOF' (code 1099) -
> internal or network failure attempting to communicate with sister MOM's
>
> The job crashed and the node was unresponsive. A manual hard reboot was in
> order.
> It seems to be a PBS communications problem but the system logs are not
> really giving any good information.
>
> Does anyone have any ideas/suggestions on why this is occurring and how to
> resolve the problem? Has anyone else seen these same types of errors and
> had success resolving the problem? Any input anyone may have will be
> appreciated. Thanks.
EOF errors would be the effect of the node lockups/reboots. Any node
lockups, crashes, or reboots triggered by userspace software are
necessarily kernel bugs.
More information about the torqueusers
mailing list