FW: [torqueusers] p4_error: interrupt SIGx: 15

Samir Khanal skhanal at bgsu.edu
Fri Feb 20 14:54:12 MST 2009


Hi Gus
i ran into another problem

I am able to run the examples with
mpiexec -n 8 hostname

[skhanal at comet ~]$ mpicxx -L/home/skhanal/bgtw/lib -lbgtw bgtwTorusTest.cpp -o Torus
but when i submitted a program to PBS with this script
-------------------
#PBS -l walltime=1:00:00
#PBS -N my_job
#PBS -j oe
#PBS -l nodes=6:ppn=4
echo `hostname`
echo Directory is `pwd`
echo This job is running on following Processors
echo `cat $PBS_NODEFILE`
export LD_LIBRARY_PATH=/home/skhanal/bgtw/lib:$LD_LIBRARY_PATH
export PATH=/home/skhanal/mpich2/bin:$PATH
mpdboot -f mpd.hosts -n 7
time mpiexec -n  2 ./Torus
---------------
I got the following  output

--------------
compute-0-5.local
Directory is /home/skhanal
This job is running on following Processors
compute-0-5 compute-0-5 compute-0-5 compute-0-5 compute-0-4 compute-0-4 compute-0-4 compute-0-4 compute-0-3 compute-0-3 compute-0-3 compute-0-3 compute-0-2 compute-0-2 compute-0-2 compute-0-2 compute-0-1 compute-0-1 compute-0-1 compute-0-1 compute-0-0 compute-0-0 compute-0-0 compute-0-0
Fatal error in MPI_Comm_size: Invalid communica
________________________________________
From: Samir Khanal
Sent: Friday, February 20, 2009 3:13 PM
To: Gus Correa
Subject: RE: [torqueusers] p4_error: interrupt SIGx: 15tor, error stack:
MPI_Comm_size(112): MPI_Comm_size(comm=0x5b, size=0x7fff3ba06a7c) failed
MPI_Comm_size(70).: Invalid communicatorFatal error in MPI_Comm_size: Invalid communicator, error stack:
MPI_Comm_size(112): MPI_Comm_size(comm=0x5b, size=0x7fffebaa0ccc) failed
MPI_Comm_size(70).: Invalid communicatorrank 1 in job 1  compute-0-5.local_48777   caused collective abort of all ranks
  exit status of rank 1: killed by signal 9
---------------
My mpd.hosts output looks like

compute-0-0:4
compute-0-1:4
compute-0-2:4
compute-0-3:4
compute-0-4:4
compute-0-5:4
comet.cs.bgsu.edu:8
~
when i do mpiexec -n 1 ./Torus  it works.


Thanks
Samir
________________________________________
From: Gus Correa [gus at ldeo.columbia.edu]
Sent: Friday, February 20, 2009 1:12 PM
To: Samir Khanal
Subject: Re: [torqueusers] p4_error: interrupt SIGx: 15

Hi Samir

You are welcome.

Not really my business, but you may consider to
experiment with OpenMPI also.
It is very easy to build (great README file, excellent FAQ).
It can be built with support
for Torque or SGE, for Ethernet (TCP/IP), Myrinet (GM or MX),
and Infiniband on the same build,
and is a lot more flexible than MPICH2 in
many ways.
The runtime environment,
for instance, is much more rich than in MPICH2.

Rocks has OpenMPI as a default but it is built only with gcc
(maybe gfortran in Rocks 5.1, but I have Rocks 4.3).
However, I built OpenMPI from scracth to get Torque and
Fortran support (different builds for PGI and Intel).
It took no time to do it.

OpenMPI+Torque give you much finer control over jobs.

The climate and ocean models that we run here compiled
and ran with OpenMPI without any problem (on GigE).
We continue to use MPICH2 also, and I think it is very good
to have more than one alternative free form of MPI.

Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------

Samir Khanal wrote:
> Hi Gus
> Yes it did solve the problem...
> I built mpich2 with nemesis option.
> Thank you
> Samir
> ________________________________________
> From: Gus Correa [gus at ldeo.columbia.edu]
> Sent: Friday, February 20, 2009 12:24 PM
> To: Torque Users
> Cc: Samir Khanal
> Subject: Re: [torqueusers] p4_error: interrupt SIGx: 15
>
> Hi Samir, list
>
> This is not likely to be a Torque problem,
> but generated by the old MPICH-1 and how it works/doesn't work
> with newer kernels.
>
> The easy fix is to upgrade to MPICH2
> with the Nemesis communication channel,
> or to OpenMPI.
> There was a long discussion about p4 errors from
> MPICH 1.2.7p1 in Rocks 5.1 not so long ago.
> Follow this thread backwards to see the details:
>
> http://marc.info/?l=npaci-rocks-discussion&m=123175012813683&w=2
>
> I hope this helps,
> Gus Correa
>
> Samir Khanal wrote:
>> Hi All
>> I am getting this
>> p4_error: interrupt SIGx: 15 error all the time i use
>>
>> the following PBS job submission , i am using Torque/Maui with Rocks5.1 & MPICH 1.2.7p1
>> #PBS -l walltime=1:00:00
>> #PBS -N my_job
>> #PBS -j oe
>> #PBS -l nodes=2
>> export LD_LIBRARY_PATH=/home/skhanal/bgtw/lib:$LD_LIBRARY_PATH
>> time /home/skhanal/mpiexec/bin/mpiexec -verbose -mpich-p4-no-shmem ./Torus
>>
>> -------------------------------
>> compute-0-5.local
>> compute-0-5 compute-0-5
>> mpiexec: resolve_exe: using absolute path "./Torus".
>> node  0: name compute-0-5, cpu avail 2
>> mpiexec: process_start_event: evt 2 task 0 on compute-0-5.
>> mpiexec: read_p4_master_port: waiting for port from master.
>> mpiexec: read_p4_master_port: got port 38143.
>> mpiexec: process_start_event: evt 4 task 1 on compute-0-5.
>> mpiexec: All 2 tasks (spawn 0) started.
>> mpiexec: wait_tasks: waiting for compute-0-5 compute-0-5.
>> mpiexec: killall: caught signal 15 (Terminated).
>> mpiexec: kill_tasks: killing all tasks.
>> mpiexec: wait_tasks: waiting for compute-0-5 compute-0-5.
>> p0_8747:  p4_error: interrupt SIGx: 15
>> p1_8749:  p4_error: interrupt SIGx: 13
>> p1_8749: (210.218750) net_send: could not write to fd=5, errno = 32
>> bm_list_8748:  p4_error: interrupt SIGx: 15
>> mpiexec: killall: caught signal 15 (Terminated).
>> -----------------------------
>>
>> I have seen the same problem with many users, but not any suggested solutions.
>> Is there a way to fix this problem. I am spending a lot of time on this, please help
>>
>> Thank you
>> Samir_______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list