[torqueusers] Job dies randomly, but only through torque

Jan Ploski Jan.Ploski at offis.de
Tue May 27 10:25:59 MDT 2008

torqueusers-bounces at supercluster.org schrieb am 05/27/2008 06:14:30 PM:

> Hi all:
> I've got a problem with a users' MPI job.  This code is in use on
> dozzens of clusters around the world, but for some reason, when run on
> my Rocks 4.3 cluster, it dies at random timesteps.  The logs are quite
> unhelpful:
> [root at aeolus logs]# more 2047.aeolus.OU
> Warning: no access to tty (Bad file descriptor).
> Thus no job control in this shell.
> data directory is  /mnt/pvfs2/patton/data/chem/aa1
> exec directory is  /mnt/pvfs2/patton/exec/chem/aa1
> arch directory is  /mnt/pvfs2/patton/data/chem/aa1
> mpirun: killing job...
> Terminated
> WARNING: mpirun is in the process of killing a job, but has detected an
> interruption (probably control-C).
> It is dangerous to interrupt mpirun while it is killing a job (proper
> termination may not be guaranteed).  Hit control-C again within 1
> second if you really want to kill mpirun immediately.
> [compute-0-0.local:03444] OOB: Connection to HNP lost
> We've been trying to figure out what's going on....We've tried
> different datasets, different nodes, different numbers of processors.
> We started on OpenMPI 1.2.4 and upgraded to 1.2.6, with no change.
> We've connected the compute node to the head node directly (bypassing
> the switch, etc.) with no change.  It doesn't matter where the data is
> stored...  If we run with nodes=1 (single threaded, single cpu), then
> it runs through to completion.
> The only clue we've found happened this morning:  If we run the job
> directly with mpirun (torque has no knowledge), it runs fine.  But
> submit it through torque+maui, and it dies as above.
> I'm at a loss at this point as to how to troubleshoot this further.
> Is there a way to get more details on torque about this?  Turn up
> logging?  Any known issues that might effect this?  I have about a
> dozzen users running on the cluster, all using the scheduler, about
> half of which are MPI (and some are using nearly the entire cluster on
> a run), all without any such problems.  Any suggestions?

This suggestion is rather trivial, but since you have not mentioned 
anything in this area:

Are you sure that the job is not exceeding resource limits (walltime - 
enforced by TORQUE, or rlimits such as memory - enforced by the kernel, 
but they could be set differently in TORQUE and your manual invocations of 

Jan Ploski

More information about the torqueusers mailing list