[torqueusers] Torque with MPICH kills jobs consistently,
but OpenPBS works fine
garrick at usc.edu
Mon Dec 5 13:01:18 MST 2005
On Wed, Nov 30, 2005 at 10:50:42AM -0500, Prakash Velayutham alleged:
> Hi All,
> I have a cluster with 18 nodes. I have Torque-1.2.0p5 with MPICH-1.2.7
> and mpiexec-0.8 running.
> When I run an MPI application, after a random amount of time, I see that
> the job gets killed.
> The error given is "p0_7360: p4_error: interrupt SIGx: 15"
> When I replace Torque with OpenPBS-2.3.16, and everything else remaining
> the same, the job goes completes just fine. Of course, I recompiled
> mpiexec to use OpenPBS.
> Any thoughts please?
Sig 15 is SIGTERM, which means the job is over some sort of limit.
There should be a corresponding log entry in the MOM log and an email to
the user. Also check your scheduler log.
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20051205/ff151d05/attachment.bin
More information about the torqueusers