[torqueusers] mpiexec jobs got stuck

Abhishek Gupta abhig at Princeton.EDU
Wed May 13 15:07:39 MDT 2009


Hi,
I even tried the newer version of mpich and mpiexec but still no luck. 
When I kill the stuck job, I get the following error message:
------------------------------------------------
File locking failed in ADIOI_Set_lock. If the file system is NFS, you 
need to use NFS version 3, ensure that the lockd daemon is running on 
all the machines, and mount the directory with the 'noac' option (no 
attribute caching).
[cli_4]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 4
File locking failed in ADIOI_Set_lock. If the file system is NFS, you 
need to use NFS version 3, ensure that the lockd daemon is running on 
all the machines, and mount the directory with the 'noac' option (no 
attribute caching).
[cli_5]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 5
File locking failed in ADIOI_Set_lock. If the file system is NFS, you 
need to use NFS version 3, ensure that the lockd daemon is running on 
all the machines, and mount the directory with the 'noac' option (no 
attribute caching).
[cli_27]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 27
File locking failed in ADIOI_Set_lock. If the file system is NFS, you 
need to use NFS version 3, ensure that the lockd daemon is running on 
all the machines, and mount the directory with the 'noac' option (no 
attribute caching).
[cli_26]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 26
File locking failed in ADIOI_Set_lock. If the file system is NFS, you 
need to use NFS version 3, ensure that the lockd daemon is running on 
all the machines, and mount the directory with the 'noac' option (no 
attribute caching).
[cli_28]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 28
HDF5: infinite loop closing library
HDF5: infinite loop closing library
      
D,G,S,T,D,S,F,D,F,F,AC,FD,P,FD,P,FD,P,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD
      
D,G,S,T,D,S,F,D,F,F,AC,FD,P,FD,P,FD,P,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD
HDF5: infinite loop closing library
HDF5: infinite loop closing library
      
D,G,S,T,D,S,F,D,F,F,AC,FD,P,FD,P,FD,P,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD
      
D,G,S,T,D,S,F,D,F,F,AC,FD,P,FD,P,FD,P,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD
HDF5: infinite loop closing library
      
D,G,S,T,D,S,F,D,F,F,AC,FD,P,FD,P,FD,P,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD
mpiexec: Warning: tasks 4-5,26-28 exited with status 1.
-------------------------------------------------------

Any ideas?
Thanks,
Abhi.

Axel Kohlmeyer wrote:
> On Wed, 2009-05-13 at 12:22 -0400, Abhishek Gupta wrote:
>
> abhi,
>
>   
>> Hi Troy,
>> I was able to fix the error message I mailed in my last mail, but the
>> problem I explained in the beginning still exist, i.e. Job runs for a
>> while and then stuck forever. Like I said it runs fine till node
>> value=20 but beyond that it shows such behavior.
>> Is there anything else I can try?
>>     
>
> if the job runs for a bit and then stops, the problem is most likely
> to be found in the MPI library or communication hardware. once a 
> job is started, torque has very little to do with what happens until
> the job is finished. if this happens only with a larger number of nodes,
> it can have two reasons: a) a specific node has a problem and that
> does not get allocated for smaller jobs (assuming that nobody else
> is running on the machine) or b) there is an overload problem due to
> excessive communication. particularly some gigE switches crap out
> at too high load in uncontrolled ways and many MPI implementations 
> have no provisions for that kind of behavior (corrupted data).
>
> HTH,
>    axel.
>
>
>   
>> Thanks,
>> Abhi.
>>
>>
>> Troy Baer wrote: 
>>     
>>> On Tue, 2009-05-12 at 17:03 -0400, Abhishek Gupta wrote:
>>>   
>>>       
>>>> It is giving me an error:
>>>> mpiexec: Error: get_hosts: pbs_statjob returned neither "ncpus" nor "nodect"
>>>>
>>>> Any suggestion?
>>>>     
>>>>         
>>> What does your job script look like?  How are you requesting nodes
>>> and/or processors?
>>>
>>> 	--Troy
>>>   
>>>       
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>     
>
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20090513/4916edfe/attachment.html 


More information about the torqueusers mailing list