[torqueusers] mpiexec jobs got stuck

Abhishek Gupta abhig at Princeton.EDU
Wed May 13 15:57:01 MDT 2009


The two possible cases you mentioned in your last mail, first one is not 
true. I have checked and there is no problem with the nodes on which the 
jobs are running. I am not too sure about the second case you discussed. 
It may be switch overloading.
Abhi.

Axel Kohlmeyer wrote:
> On Wed, 2009-05-13 at 17:07 -0400, Abhishek Gupta wrote:
>   
>> Hi,
>> I even tried the newer version of mpich and mpiexec but still no luck.
>> When I kill the stuck job, I get the following error message:
>>     
>
> abhi,
>
>   
>> ------------------------------------------------
>> File locking failed in ADIOI_Set_lock. If the file system is NFS, you
>> need to use NFS version 3, ensure that the lockd daemon is running on
>> all the machines, and mount the directory with the 'noac' option (no
>> attribute caching).
>>     
>
> never seen that, but did you check on the issues the error messages
> are referring to? writing in parallel to NFS is yet another bag of
> fleas. it is quite possible, that you are overloading the locking
> support of your NFS server.
>
> you may have to check carefully about the parallel i/o options
> of you MPI installation, or try a different MPI package.
>
> this definitely has nothing to do with torque, though.
>
> cheers,
>    axel.
>
>   
>> [cli_4]: aborting job:
>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 4
>> File locking failed in ADIOI_Set_lock. If the file system is NFS, you
>> need to use NFS version 3, ensure that the lockd daemon is running on
>> all the machines, and mount the directory with the 'noac' option (no
>> attribute caching).
>> [cli_5]: aborting job:
>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 5
>> File locking failed in ADIOI_Set_lock. If the file system is NFS, you
>> need to use NFS version 3, ensure that the lockd daemon is running on
>> all the machines, and mount the directory with the 'noac' option (no
>> attribute caching).
>> [cli_27]: aborting job:
>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 27
>> File locking failed in ADIOI_Set_lock. If the file system is NFS, you
>> need to use NFS version 3, ensure that the lockd daemon is running on
>> all the machines, and mount the directory with the 'noac' option (no
>> attribute caching).
>> [cli_26]: aborting job:
>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 26
>> File locking failed in ADIOI_Set_lock. If the file system is NFS, you
>> need to use NFS version 3, ensure that the lockd daemon is running on
>> all the machines, and mount the directory with the 'noac' option (no
>> attribute caching).
>> [cli_28]: aborting job:
>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 28
>> HDF5: infinite loop closing library
>> HDF5: infinite loop closing library
>>
>> D,G,S,T,D,S,F,D,F,F,AC,FD,P,FD,P,FD,P,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD
>>
>> D,G,S,T,D,S,F,D,F,F,AC,FD,P,FD,P,FD,P,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD
>> HDF5: infinite loop closing library
>> HDF5: infinite loop closing library
>>
>> D,G,S,T,D,S,F,D,F,F,AC,FD,P,FD,P,FD,P,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD
>>
>> D,G,S,T,D,S,F,D,F,F,AC,FD,P,FD,P,FD,P,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD
>> HDF5: infinite loop closing library
>>
>> D,G,S,T,D,S,F,D,F,F,AC,FD,P,FD,P,FD,P,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD
>> mpiexec: Warning: tasks 4-5,26-28 exited with status 1.
>> -------------------------------------------------------
>>
>> Any ideas?
>> Thanks,
>> Abhi.
>>
>>     
>>>   
>>>       
>
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20090513/73ddf4d4/attachment-0001.html 


More information about the torqueusers mailing list