[torqueusers] Torque with MPICH kills jobs consistently, but OpenPBS works fine

Garrick Staples garrick at usc.edu
Mon Dec 5 13:01:18 MST 2005


On Wed, Nov 30, 2005 at 10:50:42AM -0500, Prakash Velayutham alleged:
> Hi All,
> 
> I have a cluster with 18 nodes. I have Torque-1.2.0p5 with MPICH-1.2.7 
> and mpiexec-0.8 running.
> When I run an MPI application, after a random amount of time, I see that 
> the job gets killed.
> The error given is "p0_7360:  p4_error: interrupt SIGx: 15"
> 
> When I replace Torque with OpenPBS-2.3.16, and everything else remaining 
> the same, the job goes completes just fine. Of course, I recompiled 
> mpiexec to use OpenPBS.
> 
> Any thoughts please?

Sig 15 is SIGTERM, which means the job is over some sort of limit.
There should be a corresponding log entry in the MOM log and an email to
the user.  Also check your scheduler log.

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20051205/ff151d05/attachment.bin


More information about the torqueusers mailing list