[torqueusers] p4_error: interrupt SIGx: 15

Gus Correa gus at ldeo.columbia.edu
Fri Feb 20 10:24:37 MST 2009


Hi Samir, list

This is not likely to be a Torque problem,
but generated by the old MPICH-1 and how it works/doesn't work
with newer kernels.

The easy fix is to upgrade to MPICH2
with the Nemesis communication channel,
or to OpenMPI.
There was a long discussion about p4 errors from
MPICH 1.2.7p1 in Rocks 5.1 not so long ago.
Follow this thread backwards to see the details:

http://marc.info/?l=npaci-rocks-discussion&m=123175012813683&w=2

I hope this helps,
Gus Correa

Samir Khanal wrote:
> Hi All
> I am getting this 
> p4_error: interrupt SIGx: 15 error all the time i use 
> 
> the following PBS job submission , i am using Torque/Maui with Rocks5.1 & MPICH 1.2.7p1
> #PBS -l walltime=1:00:00
> #PBS -N my_job
> #PBS -j oe
> #PBS -l nodes=2
> export LD_LIBRARY_PATH=/home/skhanal/bgtw/lib:$LD_LIBRARY_PATH
> time /home/skhanal/mpiexec/bin/mpiexec -verbose -mpich-p4-no-shmem ./Torus
> 
> -------------------------------
> compute-0-5.local
> compute-0-5 compute-0-5
> mpiexec: resolve_exe: using absolute path "./Torus".
> node  0: name compute-0-5, cpu avail 2
> mpiexec: process_start_event: evt 2 task 0 on compute-0-5.
> mpiexec: read_p4_master_port: waiting for port from master.
> mpiexec: read_p4_master_port: got port 38143.
> mpiexec: process_start_event: evt 4 task 1 on compute-0-5.
> mpiexec: All 2 tasks (spawn 0) started.
> mpiexec: wait_tasks: waiting for compute-0-5 compute-0-5.
> mpiexec: killall: caught signal 15 (Terminated).
> mpiexec: kill_tasks: killing all tasks.
> mpiexec: wait_tasks: waiting for compute-0-5 compute-0-5.
> p0_8747:  p4_error: interrupt SIGx: 15
> p1_8749:  p4_error: interrupt SIGx: 13
> p1_8749: (210.218750) net_send: could not write to fd=5, errno = 32
> bm_list_8748:  p4_error: interrupt SIGx: 15
> mpiexec: killall: caught signal 15 (Terminated).
> -----------------------------
> 
> I have seen the same problem with many users, but not any suggested solutions.
> Is there a way to fix this problem. I am spending a lot of time on this, please help
> 
> Thank you 
> Samir_______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list