[torqueusers] requested job die, code 1099

brad mecklenburg bmecklenburg at colsa.com
Fri Sep 15 08:52:56 MDT 2006


We have two cluster systems reporting this problem:  128 node IBM Open Power
5 cluster with torque 1.2.0.p4 and maui 3.2.6p9. I know these are older
versions but will get to that in a minute.

I have had a problem with random compute nodes reporting the following error
after the pbs job was run using qsub. Mpirun is also used.

Job: 566.marvin
07/19/2006 14:15:15 M Job Modified at request of PBS_Server at marvin
07/19/2006 14:26:56 M node 42 (r05n18) requested job die, code 1099

The node reboots itself thus crashing the pbs job.  The 1099 error either
causes the compute node to crash or the 1099 error is a result of the crash.
Not positive on this one.  However, the compute nodes that receive this
error are random.  The same compute node has not reported the problem.

So an upgrade in PBS was in order.  The following is what we have installed
on our Apple test cluster.
MPICH ­ 1.2.7..1
OSX - 10.4.7
XSAN - 1.4
PBS Torque - 2.1.2 
MX Driver ­ 1.1.4 w\ fma mapper
MAUI ­ 3.2.6 p16
Mpiexec .81

We are currently testing this on our 128 Apple Xserve G5 test cluster before
we implement the upgrade on our production system.

However, this morning I saw the same error as on our IBM cluster.  The error
we see on this system is:
PBS: job killed: node 42 (r01n46) requested job die, 'EOF' (code 1099) -
internal or network failure attempting to communicate with sister MOM's

The job crashed and the node was unresponsive. A manual hard reboot was in
order. 
It seems to be a PBS communications problem but the system logs are not
really giving any good information.

Does anyone have any ideas/suggestions on why this is occurring and how to
resolve the problem?  Has anyone else seen these same types of errors and
had success resolving the problem? Any input anyone may have will be
appreciated. Thanks.

Brad


--





More information about the torqueusers mailing list