FW: [torqueusers] p4_error: interrupt SIGx: 15

Gus Correa gus at ldeo.columbia.edu
Fri Feb 20 17:18:16 MST 2009


Hi Samir

I am confused by the apparent mismatch between the
number of CPUs/cores you request (24):

#PBS -l nodes=6:ppn=4

and the number of processes you launch with mpiexec (2):

time mpiexec -n  2 ./Torus

Is your Torus code able to run on 2 processors,
or does it require 24, or perhaps the number of processors is
immaterial?

Anyway, this is now more of an MPI/MPICH issue than a Torque issue.

Gus Correa

Samir Khanal wrote:
> Hi Gus
> i ran into another problem
> 
> I am able to run the examples with
> mpiexec -n 8 hostname
> 
> [skhanal at comet ~]$ mpicxx -L/home/skhanal/bgtw/lib -lbgtw bgtwTorusTest.cpp -o Torus
> but when i submitted a program to PBS with this script
> -------------------
> #PBS -l walltime=1:00:00
> #PBS -N my_job
> #PBS -j oe
> #PBS -l nodes=6:ppn=4
> echo `hostname`
> echo Directory is `pwd`
> echo This job is running on following Processors
> echo `cat $PBS_NODEFILE`
> export LD_LIBRARY_PATH=/home/skhanal/bgtw/lib:$LD_LIBRARY_PATH
> export PATH=/home/skhanal/mpich2/bin:$PATH
> mpdboot -f mpd.hosts -n 7
> time mpiexec -n  2 ./Torus
> ---------------
> I got the following  output
> 
> --------------
> compute-0-5.local
> Directory is /home/skhanal
> This job is running on following Processors
> compute-0-5 compute-0-5 compute-0-5 compute-0-5 compute-0-4 compute-0-4 compute-0-4 compute-0-4 compute-0-3 compute-0-3 compute-0-3 compute-0-3 compute-0-2 compute-0-2 compute-0-2 compute-0-2 compute-0-1 compute-0-1 compute-0-1 compute-0-1 compute-0-0 compute-0-0 compute-0-0 compute-0-0
> Fatal error in MPI_Comm_size: Invalid communica
> ________________________________________
> From: Samir Khanal
> Sent: Friday, February 20, 2009 3:13 PM
> To: Gus Correa
> Subject: RE: [torqueusers] p4_error: interrupt SIGx: 15tor, error stack:
> MPI_Comm_size(112): MPI_Comm_size(comm=0x5b, size=0x7fff3ba06a7c) failed
> MPI_Comm_size(70).: Invalid communicatorFatal error in MPI_Comm_size: Invalid communicator, error stack:
> MPI_Comm_size(112): MPI_Comm_size(comm=0x5b, size=0x7fffebaa0ccc) failed
> MPI_Comm_size(70).: Invalid communicatorrank 1 in job 1  compute-0-5.local_48777   caused collective abort of all ranks
>   exit status of rank 1: killed by signal 9
> ---------------
> My mpd.hosts output looks like
> 
> compute-0-0:4
> compute-0-1:4
> compute-0-2:4
> compute-0-3:4
> compute-0-4:4
> compute-0-5:4
> comet.cs.bgsu.edu:8
> ~
> when i do mpiexec -n 1 ./Torus  it works.
> 
> 
> Thanks
> Samir
> ________________________________________
> From: Gus Correa [gus at ldeo.columbia.edu]
> Sent: Friday, February 20, 2009 1:12 PM
> To: Samir Khanal
> Subject: Re: [torqueusers] p4_error: interrupt SIGx: 15
> 
> Hi Samir
> 
> You are welcome.
> 
> Not really my business, but you may consider to
> experiment with OpenMPI also.
> It is very easy to build (great README file, excellent FAQ).
> It can be built with support
> for Torque or SGE, for Ethernet (TCP/IP), Myrinet (GM or MX),
> and Infiniband on the same build,
> and is a lot more flexible than MPICH2 in
> many ways.
> The runtime environment,
> for instance, is much more rich than in MPICH2.
> 
> Rocks has OpenMPI as a default but it is built only with gcc
> (maybe gfortran in Rocks 5.1, but I have Rocks 4.3).
> However, I built OpenMPI from scracth to get Torque and
> Fortran support (different builds for PGI and Intel).
> It took no time to do it.
> 
> OpenMPI+Torque give you much finer control over jobs.
> 
> The climate and ocean models that we run here compiled
> and ran with OpenMPI without any problem (on GigE).
> We continue to use MPICH2 also, and I think it is very good
> to have more than one alternative free form of MPI.
> 
> Gus Correa
> ---------------------------------------------------------------------
> Gustavo Correa
> Lamont-Doherty Earth Observatory - Columbia University
> Palisades, NY, 10964-8000 - USA
> ---------------------------------------------------------------------
> 
> Samir Khanal wrote:
>> Hi Gus
>> Yes it did solve the problem...
>> I built mpich2 with nemesis option.
>> Thank you
>> Samir
>> ________________________________________
>> From: Gus Correa [gus at ldeo.columbia.edu]
>> Sent: Friday, February 20, 2009 12:24 PM
>> To: Torque Users
>> Cc: Samir Khanal
>> Subject: Re: [torqueusers] p4_error: interrupt SIGx: 15
>>
>> Hi Samir, list
>>
>> This is not likely to be a Torque problem,
>> but generated by the old MPICH-1 and how it works/doesn't work
>> with newer kernels.
>>
>> The easy fix is to upgrade to MPICH2
>> with the Nemesis communication channel,
>> or to OpenMPI.
>> There was a long discussion about p4 errors from
>> MPICH 1.2.7p1 in Rocks 5.1 not so long ago.
>> Follow this thread backwards to see the details:
>>
>> http://marc.info/?l=npaci-rocks-discussion&m=123175012813683&w=2
>>
>> I hope this helps,
>> Gus Correa
>>
>> Samir Khanal wrote:
>>> Hi All
>>> I am getting this
>>> p4_error: interrupt SIGx: 15 error all the time i use
>>>
>>> the following PBS job submission , i am using Torque/Maui with Rocks5.1 & MPICH 1.2.7p1
>>> #PBS -l walltime=1:00:00
>>> #PBS -N my_job
>>> #PBS -j oe
>>> #PBS -l nodes=2
>>> export LD_LIBRARY_PATH=/home/skhanal/bgtw/lib:$LD_LIBRARY_PATH
>>> time /home/skhanal/mpiexec/bin/mpiexec -verbose -mpich-p4-no-shmem ./Torus
>>>
>>> -------------------------------
>>> compute-0-5.local
>>> compute-0-5 compute-0-5
>>> mpiexec: resolve_exe: using absolute path "./Torus".
>>> node  0: name compute-0-5, cpu avail 2
>>> mpiexec: process_start_event: evt 2 task 0 on compute-0-5.
>>> mpiexec: read_p4_master_port: waiting for port from master.
>>> mpiexec: read_p4_master_port: got port 38143.
>>> mpiexec: process_start_event: evt 4 task 1 on compute-0-5.
>>> mpiexec: All 2 tasks (spawn 0) started.
>>> mpiexec: wait_tasks: waiting for compute-0-5 compute-0-5.
>>> mpiexec: killall: caught signal 15 (Terminated).
>>> mpiexec: kill_tasks: killing all tasks.
>>> mpiexec: wait_tasks: waiting for compute-0-5 compute-0-5.
>>> p0_8747:  p4_error: interrupt SIGx: 15
>>> p1_8749:  p4_error: interrupt SIGx: 13
>>> p1_8749: (210.218750) net_send: could not write to fd=5, errno = 32
>>> bm_list_8748:  p4_error: interrupt SIGx: 15
>>> mpiexec: killall: caught signal 15 (Terminated).
>>> -----------------------------
>>>
>>> I have seen the same problem with many users, but not any suggested solutions.
>>> Is there a way to fix this problem. I am spending a lot of time on this, please help
>>>
>>> Thank you
>>> Samir_______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list