[torqueusers] p4_error: latest msg from perror: Bad file descriptor

Vadivelan Ranjith achillesvelan at yahoo.co.in
Thu Oct 12 04:50:35 MDT 2006


Hi 
I thank you very much for reply and taking care for my problem. 
I debug my code using idb. i got the following error. I dont know debugging command is correct or not.
First i compiled the code using 

mpif90 -g test.f -o test.exe

And then i tried to debug it.

/opt/mpich/intel/bin/mpirun -dbg=idb -np 6 test.exe
I got the following error. Can you able to find what mistake i did.

[velan at galaxy debug]$ /opt/mpich/intel/bin/mpirun -dbg=idb -np 6 test.exe
out.e
Intel(R) Debugger for IA-32 -based Applications, Version 9.0-16, Build
20051121
Reading symbolic information from /home/velan/debug/test.exe...done
Evaluating '::MPIR_proctable[0]' failed!
The value (166516824) is not an array or pointer!
Fatal error: Can't find process information.
p5_21896:  p4_error: net_recv read:  probable EOF on socket: 1
p4_17944:  p4_error: interrupt SIGx: 13
[velan at galaxy debug]$ p2_21089:  p4_error: interrupt SIGx: 13
rm_l_5_21913: (0.055103) net_send: could not write to fd=5, errno = 32
p5_21896: (0.058996) net_send: could not write to fd=5, errno = 32
p4_17944: (10.280392) net_send: could not write to fd=5, errno = 32

p2_21089: (11.816449) net_send: could not write to fd=5, errno = 32
I try to run in parallel mode. i got the following error.

/opt/mpich/intel/bin/mpirun -parallel -dbg=idb -np 4 test.exe
Unrecognized argument -parallel ignored.
Intel(R) Debugger for IA-32 -based Applications, Version 9.0-16, Build 20051121
Reading symbolic information from /home/velan/debug/test.exe...done
Evaluating '::MPIR_proctable[0]' failed!
The value (166514304) is not an array or pointer!
Fatal error: Can't find process information.

Can you help me how to fix my problem
Thanks
Velan

Rajeev Thakur <thakur at mcs.anl.gov> wrote:     The error message says that one of the requests passed to  MPI_Waitall is invalid. It is hard to tell further what may be the problem, but  it's likely a bug in your code.
  
 Rajeev

       
---------------------------------
   From: Vadivelan Ranjith    [mailto:achillesvelan at yahoo.co.in] 
Sent: Wednesday, October 11,    2006 8:09 AM
To: Rajeev Thakur
Cc:    mpi-maint at mcs.anl.gov
Subject: RE: [MPI #11007] p4_error: latest msg    from perror: Bad file descriptor


   
Hi
Thanks for reply. i got the following error when i use    mpich2--1.0.3.
This machine was installed manually(not rocks). Everytime    the name of the compute node is changing in the job.o file.
For    example:
..."rank 5 in job 1     node07.cluster2.iitb.ac.in_32814"
..."rank 5 in job 1     node08.cluster2.iitb.ac.in_32776"
..."rank 5 in job 1     node09.cluster2.iitb.ac.in_32817"
I dont know how to fix the problem. If    anybody knows help    me

Velan
----------------------------------------------------------------------------------------------
job.e
----------------------------------------------------------------------------------------------
velan at galaxy:~/3DSIM$    cat job.e10340
[cli_5]: aborting job:
Fatal error in MPI_Waitall:    Invalid MPI_Request, error stack:
MPI_Waitall(241): MPI_Waitall(count=250,    req_array=0x9f7ede0, status_array=0x9f4e820) failed
MPI_Waitall(109):    Invalid    MPI_Request
----------------------------------------------------------------------------------------------
job.o
----------------------------------------------------------------------------------------------
velan at galaxy:~/3DSIM$    cat job.o10340
# Allocating   5 nodes to block  1
#    Allocating   1 nodes to block  2
# Require mxb    >=   97
# Require mxa >=   26 mya    >=   97 and mza >=   75
# Maximum load imbalance    =  71.69%
#
# Navier-Stokes Simulation
# Implicit Full Matrix    DP-LUR

rank 5 in job 1     node09.cluster2.iitb.ac.in_32776   caused collective abort of all    ranks  exit status of rank 5: killed by signal    9
----------------------------------------------------------------------------------------------


Rajeev    Thakur <thakur at mcs.anl.gov> wrote:             It is hard to tell, but it is possible that there might      be some bug in your code. You can try using MPICH2 and see if you get any      better error message.
      
     Rajeev

                   
---------------------------------
       From: Vadivelan Ranjith        [mailto:achillesvelan at yahoo.co.in] 
Sent: Tuesday, October 10,        2006 7:56 AM
To: mpi-maint at mcs.anl.gov
Cc:        mpi-maint at mcs.anl.gov
Subject: [MPI #11007] p4_error: latest msg        from perror: Bad file descriptor


       

Hi
I thank you for helping to all.
Today i got a error message by sumbitting job. First i
ran the code using explict method. I got result accurately, and no
problem occured when i sumbit job. Now i changed my code to implict method.
I got error when i sumbit job.
I checked correctly, it reading all files and
iteration starts. after one iteration it gives the following error. The same
code is running on other machine, giving result correctly. So please help
me how to fix it.

Advance thanks
Velan

----------------------------------------------------------------
job.e file:
   
 p4_error: latest msg from perror: Bad file descriptor
    p4_error: latest msg from perror: Bad file descriptor
    p4_error: latest msg from perror: Bad file descriptor
    p4_error: latest msg from perror: Bad file descriptor
-----------------------------------------------------------------
job.o
  file:
3
node18.local
node19.local
node17.local
# Allocating   5 nodes to block  1
# Allocating   1 nodes to block  2
# Require mxb >=   97
# Require mxa >=   26 mya >=   97 and mza >=   75
# Maximum load imbalance =  71.69%
# Navier-Stokes Simulation
# Implicit Full Matrix DP-LUR
# Reading restart files...( 0.34 seconds)
# Freestream Mach Number =  6.50

 1   0.3670E+01   0.7803E+05   16   15    7    2  0.1222E-08
p5_2609:  p4_error: interrupt SIGx: 13 
bm_list_17559: (3.666982) wakeup_slave: unable to interrupt slave 0 pid 17542
rm_l_1_18696: (2.738297) net_send: could not write to fd=6, errno
 = 9
rm_l_1_18696:  p4_error: net_send write: -1
rm_l_2_2605: (2.614927) net_send: could not write to fd=6, errno = 9
rm_l_4_18718: (2.373120) net_send: could not write to fd=6, errno = 9
rm_l_4_18718:  p4_error: net_send write: -1
rm_l_2_2605:  p4_error: net_send write: -1
rm_l_3_17584:
 (2.496277)
 net_send: could not write to fd=6, errno = 9
rm_l_3_17584:  p4_error: net_send write: -1
rm_l_5_2626: (2.249144) net_send: could not write to fd=5, errno = 32
p5_2609: (2.251356) net_send: could not write to fd=5,errno = 32
-------------------------------------------------------------------
job file:
#!/bin/bash
#PBS -l nodes=3:ppn=1

cd $PBS_O_WORKDIR
n=`/usr/local/bin/pbs.py $PBS_NODEFILE hosts`
echo $n
cat hosts
/opt/mpich/intel/bin/mpirun -nolocal -machinefile
hosts -np 6 pg3d.exe
-------------------------------------------------------------------
Machine configuration:
 CPU: Intel(R) Dual Processor
 Xeon(R) CPU 3.2GHz
Installation using rocks4.1
       
       
---------------------------------
       Find out what India is talking about on - Yahoo!        Answers India 
Send FREE SMS to your friend's mobile from Yahoo!        Messenger Version 8. Get        it NOW
      

---------------------------------
   Find out what India is talking about on - Yahoo!    Answers India 
Send FREE SMS to your friend's mobile from Yahoo!    Messenger Version 8. Get    it NOW

 				
---------------------------------
 Find out what India is talking about on  - Yahoo! Answers India 
 Send FREE SMS to your friend's mobile from Yahoo! Messenger Version 8. Get it NOW
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20061012/516ecdce/attachment-0001.html


More information about the torqueusers mailing list