[torqueusers] mpiexec jobs got stuck

Axel Kohlmeyer akohlmey at cmm.chem.upenn.edu
Wed May 13 15:13:37 MDT 2009


On Wed, 2009-05-13 at 17:07 -0400, Abhishek Gupta wrote:
> Hi,
> I even tried the newer version of mpich and mpiexec but still no luck.
> When I kill the stuck job, I get the following error message:

abhi,

> ------------------------------------------------
> File locking failed in ADIOI_Set_lock. If the file system is NFS, you
> need to use NFS version 3, ensure that the lockd daemon is running on
> all the machines, and mount the directory with the 'noac' option (no
> attribute caching).

never seen that, but did you check on the issues the error messages
are referring to? writing in parallel to NFS is yet another bag of
fleas. it is quite possible, that you are overloading the locking
support of your NFS server.

you may have to check carefully about the parallel i/o options
of you MPI installation, or try a different MPI package.

this definitely has nothing to do with torque, though.

cheers,
   axel.

> [cli_4]: aborting job:
> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 4
> File locking failed in ADIOI_Set_lock. If the file system is NFS, you
> need to use NFS version 3, ensure that the lockd daemon is running on
> all the machines, and mount the directory with the 'noac' option (no
> attribute caching).
> [cli_5]: aborting job:
> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 5
> File locking failed in ADIOI_Set_lock. If the file system is NFS, you
> need to use NFS version 3, ensure that the lockd daemon is running on
> all the machines, and mount the directory with the 'noac' option (no
> attribute caching).
> [cli_27]: aborting job:
> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 27
> File locking failed in ADIOI_Set_lock. If the file system is NFS, you
> need to use NFS version 3, ensure that the lockd daemon is running on
> all the machines, and mount the directory with the 'noac' option (no
> attribute caching).
> [cli_26]: aborting job:
> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 26
> File locking failed in ADIOI_Set_lock. If the file system is NFS, you
> need to use NFS version 3, ensure that the lockd daemon is running on
> all the machines, and mount the directory with the 'noac' option (no
> attribute caching).
> [cli_28]: aborting job:
> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 28
> HDF5: infinite loop closing library
> HDF5: infinite loop closing library
> 
> D,G,S,T,D,S,F,D,F,F,AC,FD,P,FD,P,FD,P,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD
> 
> D,G,S,T,D,S,F,D,F,F,AC,FD,P,FD,P,FD,P,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD
> HDF5: infinite loop closing library
> HDF5: infinite loop closing library
> 
> D,G,S,T,D,S,F,D,F,F,AC,FD,P,FD,P,FD,P,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD
> 
> D,G,S,T,D,S,F,D,F,F,AC,FD,P,FD,P,FD,P,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD
> HDF5: infinite loop closing library
> 
> D,G,S,T,D,S,F,D,F,F,AC,FD,P,FD,P,FD,P,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD
> mpiexec: Warning: tasks 4-5,26-28 exited with status 1.
> -------------------------------------------------------
> 
> Any ideas?
> Thanks,
> Abhi.
> 
> >   

-- 
=======================================================================
Axel Kohlmeyer   akohlmey at cmm.chem.upenn.edu   http://www.cmm.upenn.edu
   Center for Molecular Modeling   --   University of Pennsylvania
Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
tel: 1-215-898-1582,  fax: 1-215-573-6233,  office-tel: 1-215-898-5425
=======================================================================
If you make something idiot-proof, the universe creates a better idiot.



More information about the torqueusers mailing list