[torqueusers] Job dies randomly, but only through torque
Jan Ploski
Jan.Ploski at offis.de
Tue May 27 10:25:59 MDT 2008
torqueusers-bounces at supercluster.org schrieb am 05/27/2008 06:14:30 PM:
> Hi all:
>
> I've got a problem with a users' MPI job. This code is in use on
> dozzens of clusters around the world, but for some reason, when run on
> my Rocks 4.3 cluster, it dies at random timesteps. The logs are quite
> unhelpful:
>
> [root at aeolus logs]# more 2047.aeolus.OU
> Warning: no access to tty (Bad file descriptor).
> Thus no job control in this shell.
> data directory is /mnt/pvfs2/patton/data/chem/aa1
> exec directory is /mnt/pvfs2/patton/exec/chem/aa1
> arch directory is /mnt/pvfs2/patton/data/chem/aa1
> mpirun: killing job...
>
> Terminated
>
--------------------------------------------------------------------------
> WARNING: mpirun is in the process of killing a job, but has detected an
> interruption (probably control-C).
>
> It is dangerous to interrupt mpirun while it is killing a job (proper
> termination may not be guaranteed). Hit control-C again within 1
> second if you really want to kill mpirun immediately.
>
--------------------------------------------------------------------------
> [compute-0-0.local:03444] OOB: Connection to HNP lost
>
> We've been trying to figure out what's going on....We've tried
> different datasets, different nodes, different numbers of processors.
> We started on OpenMPI 1.2.4 and upgraded to 1.2.6, with no change.
> We've connected the compute node to the head node directly (bypassing
> the switch, etc.) with no change. It doesn't matter where the data is
> stored... If we run with nodes=1 (single threaded, single cpu), then
> it runs through to completion.
>
> The only clue we've found happened this morning: If we run the job
> directly with mpirun (torque has no knowledge), it runs fine. But
> submit it through torque+maui, and it dies as above.
>
> I'm at a loss at this point as to how to troubleshoot this further.
> Is there a way to get more details on torque about this? Turn up
> logging? Any known issues that might effect this? I have about a
> dozzen users running on the cluster, all using the scheduler, about
> half of which are MPI (and some are using nearly the entire cluster on
> a run), all without any such problems. Any suggestions?
This suggestion is rather trivial, but since you have not mentioned
anything in this area:
Are you sure that the job is not exceeding resource limits (walltime -
enforced by TORQUE, or rlimits such as memory - enforced by the kernel,
but they could be set differently in TORQUE and your manual invocations of
mpirun).
Regards,
Jan Ploski
More information about the torqueusers
mailing list