FW: [torqueusers] p4_error: interrupt SIGx: 15

Gus Correa gus at ldeo.columbia.edu
Mon Feb 23 08:42:01 MST 2009


Hi Samir

1) First and foremost, by all means,
please subscribe to the MPICH2 list,
if you are not registered there yet.

You need to ask these questions on the MPICH mailing list.
This is definitely not a Torque issue.

You will get more attention and assistance in the
MPICH list, where other people besides me can help you better.
We are exchanging messages on the wrong list, with the wrong
audience.

2) Second, if the MPICH examples run correctly,
and the "Torus" code doesn't,
then the problem is in the "Torus" code.
It may have a programming bug.

Again, the MPICH list can help you with that,
not the Torque list.

3) Third, it doesn't make sense to mix MPICH-1 and MPICH-2.
You are shooting yourself on the foot if you mix different
MPI implementations at compile and runtime.
The compiler wrappers and the mpiexec launcher must be compatible.
Likewise, don't mix 32-bit and 64-bit libraries and executables,
it won't work in most cases, and the confusion may be very hard
to debug.
Don't even mix compilers.
For instance,
using a gcc-compiled MPICH-2 library when you compile the main
code with PGI or Intel C compiler invites trouble and deep
frustration.

My two cents.

Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------


Samir Khanal wrote:
 > Hi Gus,
 >
 > Does it make sense to compile by mpich-1.2.7 and execute using 
mpiexec in mpich2?
 >
 > My program runs well (gets compiled and gets submitted) in Mpiexec 
(OSC) 0.75 and mpich 1.2.5 GCC 4.1.1 torque 1.0.1p5 x86 gentoo
 > I am trying to port into 64 bit cluster with GC 4.1.2, mpiexec (OSC) 
0.83, mpich2 (tried mpich 1.2.7 and open mpi) and Torque 2.3.6
 >
 > Are there any obvious changes required, or the best combination on 
the new system.
 > Basically i am able to compile but not execute my code. Have spent 
about 3 hours on this but without any clue, tried all the combinations, 
the mpich-1.2.7 and mpich2's  mpiexec verion works, but only till 
processor's no is about 6-8, more than that there is a problem with all 
sorts of P4 errors.
 >
 > I am totally frustrated with all this.
 >
 > Thanks
 > Samir
 >
 >


Samir Khanal wrote:
> Hi Gus
> The Code doesnot run even on two processors.
> I tried 
> 
> mpiexec -n 2 ./Torus and it doesnot execute, 
> I tried -n 1 (single processor) it works.
> 
> I am confused as to what to do.
> :-(
> 
> Samir
> 
> 
> ________________________________________
> From: Gus Correa [gus at ldeo.columbia.edu]
> Sent: Friday, February 20, 2009 7:18 PM
> To: Torque Users
> Cc: Samir Khanal
> Subject: Re: FW: [torqueusers] p4_error: interrupt SIGx: 15
> 
> Hi Samir
> 
> I am confused by the apparent mismatch between the
> number of CPUs/cores you request (24):
> 
> #PBS -l nodes=6:ppn=4
> 
> and the number of processes you launch with mpiexec (2):
> 
> time mpiexec -n  2 ./Torus
> 
> Is your Torus code able to run on 2 processors,
> or does it require 24, or perhaps the number of processors is
> immaterial?
> 
> Anyway, this is now more of an MPI/MPICH issue than a Torque issue.
> 
> Gus Correa
> 
> Samir Khanal wrote:
>> Hi Gus
>> i ran into another problem
>>
>> I am able to run the examples with
>> mpiexec -n 8 hostname
>>
>> [skhanal at comet ~]$ mpicxx -L/home/skhanal/bgtw/lib -lbgtw bgtwTorusTest.cpp -o Torus
>> but when i submitted a program to PBS with this script
>> -------------------
>> #PBS -l walltime=1:00:00
>> #PBS -N my_job
>> #PBS -j oe
>> #PBS -l nodes=6:ppn=4
>> echo `hostname`
>> echo Directory is `pwd`
>> echo This job is running on following Processors
>> echo `cat $PBS_NODEFILE`
>> export LD_LIBRARY_PATH=/home/skhanal/bgtw/lib:$LD_LIBRARY_PATH
>> export PATH=/home/skhanal/mpich2/bin:$PATH
>> mpdboot -f mpd.hosts -n 7
>> time mpiexec -n  2 ./Torus
>> ---------------
>> I got the following  output
>>
>> --------------
>> compute-0-5.local
>> Directory is /home/skhanal
>> This job is running on following Processors
>> compute-0-5 compute-0-5 compute-0-5 compute-0-5 compute-0-4 compute-0-4 compute-0-4 compute-0-4 compute-0-3 compute-0-3 compute-0-3 compute-0-3 compute-0-2 compute-0-2 compute-0-2 compute-0-2 compute-0-1 compute-0-1 compute-0-1 compute-0-1 compute-0-0 compute-0-0 compute-0-0 compute-0-0
>> Fatal error in MPI_Comm_size: Invalid communica
>> ________________________________________
>> From: Samir Khanal
>> Sent: Friday, February 20, 2009 3:13 PM
>> To: Gus Correa
>> Subject: RE: [torqueusers] p4_error: interrupt SIGx: 15tor, error stack:
>> MPI_Comm_size(112): MPI_Comm_size(comm=0x5b, size=0x7fff3ba06a7c) failed
>> MPI_Comm_size(70).: Invalid communicatorFatal error in MPI_Comm_size: Invalid communicator, error stack:
>> MPI_Comm_size(112): MPI_Comm_size(comm=0x5b, size=0x7fffebaa0ccc) failed
>> MPI_Comm_size(70).: Invalid communicatorrank 1 in job 1  compute-0-5.local_48777   caused collective abort of all ranks
>>   exit status of rank 1: killed by signal 9
>> ---------------
>> My mpd.hosts output looks like
>>
>> compute-0-0:4
>> compute-0-1:4
>> compute-0-2:4
>> compute-0-3:4
>> compute-0-4:4
>> compute-0-5:4
>> comet.cs.bgsu.edu:8
>> ~
>> when i do mpiexec -n 1 ./Torus  it works.
>>
>>
>> Thanks
>> Samir
>> ________________________________________
>> From: Gus Correa [gus at ldeo.columbia.edu]
>> Sent: Friday, February 20, 2009 1:12 PM
>> To: Samir Khanal
>> Subject: Re: [torqueusers] p4_error: interrupt SIGx: 15
>>
>> Hi Samir
>>
>> You are welcome.
>>
>> Not really my business, but you may consider to
>> experiment with OpenMPI also.
>> It is very easy to build (great README file, excellent FAQ).
>> It can be built with support
>> for Torque or SGE, for Ethernet (TCP/IP), Myrinet (GM or MX),
>> and Infiniband on the same build,
>> and is a lot more flexible than MPICH2 in
>> many ways.
>> The runtime environment,
>> for instance, is much more rich than in MPICH2.
>>
>> Rocks has OpenMPI as a default but it is built only with gcc
>> (maybe gfortran in Rocks 5.1, but I have Rocks 4.3).
>> However, I built OpenMPI from scracth to get Torque and
>> Fortran support (different builds for PGI and Intel).
>> It took no time to do it.
>>
>> OpenMPI+Torque give you much finer control over jobs.
>>
>> The climate and ocean models that we run here compiled
>> and ran with OpenMPI without any problem (on GigE).
>> We continue to use MPICH2 also, and I think it is very good
>> to have more than one alternative free form of MPI.
>>
>> Gus Correa
>> ---------------------------------------------------------------------
>> Gustavo Correa
>> Lamont-Doherty Earth Observatory - Columbia University
>> Palisades, NY, 10964-8000 - USA
>> ---------------------------------------------------------------------
>>
>> Samir Khanal wrote:
>>> Hi Gus
>>> Yes it did solve the problem...
>>> I built mpich2 with nemesis option.
>>> Thank you
>>> Samir
>>> ________________________________________
>>> From: Gus Correa [gus at ldeo.columbia.edu]
>>> Sent: Friday, February 20, 2009 12:24 PM
>>> To: Torque Users
>>> Cc: Samir Khanal
>>> Subject: Re: [torqueusers] p4_error: interrupt SIGx: 15
>>>
>>> Hi Samir, list
>>>
>>> This is not likely to be a Torque problem,
>>> but generated by the old MPICH-1 and how it works/doesn't work
>>> with newer kernels.
>>>
>>> The easy fix is to upgrade to MPICH2
>>> with the Nemesis communication channel,
>>> or to OpenMPI.
>>> There was a long discussion about p4 errors from
>>> MPICH 1.2.7p1 in Rocks 5.1 not so long ago.
>>> Follow this thread backwards to see the details:
>>>
>>> http://marc.info/?l=npaci-rocks-discussion&m=123175012813683&w=2
>>>
>>> I hope this helps,
>>> Gus Correa
>>>
>>> Samir Khanal wrote:
>>>> Hi All
>>>> I am getting this
>>>> p4_error: interrupt SIGx: 15 error all the time i use
>>>>
>>>> the following PBS job submission , i am using Torque/Maui with Rocks5.1 & MPICH 1.2.7p1
>>>> #PBS -l walltime=1:00:00
>>>> #PBS -N my_job
>>>> #PBS -j oe
>>>> #PBS -l nodes=2
>>>> export LD_LIBRARY_PATH=/home/skhanal/bgtw/lib:$LD_LIBRARY_PATH
>>>> time /home/skhanal/mpiexec/bin/mpiexec -verbose -mpich-p4-no-shmem ./Torus
>>>>
>>>> -------------------------------
>>>> compute-0-5.local
>>>> compute-0-5 compute-0-5
>>>> mpiexec: resolve_exe: using absolute path "./Torus".
>>>> node  0: name compute-0-5, cpu avail 2
>>>> mpiexec: process_start_event: evt 2 task 0 on compute-0-5.
>>>> mpiexec: read_p4_master_port: waiting for port from master.
>>>> mpiexec: read_p4_master_port: got port 38143.
>>>> mpiexec: process_start_event: evt 4 task 1 on compute-0-5.
>>>> mpiexec: All 2 tasks (spawn 0) started.
>>>> mpiexec: wait_tasks: waiting for compute-0-5 compute-0-5.
>>>> mpiexec: killall: caught signal 15 (Terminated).
>>>> mpiexec: kill_tasks: killing all tasks.
>>>> mpiexec: wait_tasks: waiting for compute-0-5 compute-0-5.
>>>> p0_8747:  p4_error: interrupt SIGx: 15
>>>> p1_8749:  p4_error: interrupt SIGx: 13
>>>> p1_8749: (210.218750) net_send: could not write to fd=5, errno = 32
>>>> bm_list_8748:  p4_error: interrupt SIGx: 15
>>>> mpiexec: killall: caught signal 15 (Terminated).
>>>> -----------------------------
>>>>
>>>> I have seen the same problem with many users, but not any suggested solutions.
>>>> Is there a way to fix this problem. I am spending a lot of time on this, please help
>>>>
>>>> Thank you
>>>> Samir_______________________________________________
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list