[torqueusers] pbs_mom unable to chdir to automounted dirs

James J Coyle jjc at iastate.edu
Wed Oct 22 14:04:47 MDT 2008


Mary Ellen,

  This is just a hack, and perhaps you've already thought of it, 
but couldn't you just issue (cd nfsdir) in the epilogue and  
if needed in the epilogue file in mom_priv?

   These should execute just before the job script starts and just 
after it exits,a nd before the copy back, I think, so that should 
be exactly when you need it. 

-- 
 James Coyle, PhD
 SGI Origin, Alpha, Xeon and Opteron Cluster Manager
 High Performance Computing Group     
 235 Durham Center            
 Iowa State Univ.           phone: (515)-294-2099
 Ames, Iowa 50011           web: http://jjc.public.iastate.edu


-- 
 James Coyle, PhD
 SGI Origin, Alpha, Xeon and Opteron Cluster Manager
 High Performance Computing Group     
 235 Durham Center            
 Iowa State Univ.           phone: (515)-294-2099
 Ames, Iowa 50011           web: http://jjc.public.iastate.edu
> Good thought... How would I slow down pbs_mom, I tried putting a sleep 
> command in my pbs script and as Luke suggested "ls $HOME", no dice.   Do 
> I need to edit the pbs_mom daemon?
> 
> I guess another hack would be to mount (via: cd nfsdir) the nfs dirs on 
> the compute nodes, but then after the automounter timed out or reboot, I 
> would be in the same situation.  Or to hard mount the nfs dirs (do not 
> want to do this!!)
> 
> Appreciate your help.
> 
> Greenseid, Joseph M. wrote:
> > I don't have a real useful suggestion, but just a thought -- could it simply be a timing issue in that pbs_mom is trying to stat a file or the directory before it's been fully mounted?  It may take a second to get the directory mounted if it wasn't already, and maybe PBS is too fast for the auto-mounter, esp if the NFS server is under some sort of load and could be taking a little longer to respond than normal?
> >  
> > --Joe
> >
> > ________________________________
> >
> > From: torqueusers-bounces at supercluster.org on behalf of Mary Ellen Fitzpatrick
> > Sent: Wed 10/22/2008 2:53 PM
> > To: Luke Scharf
> > Cc: torqueusers at supercluster.org
> > Subject: Re: [torqueusers] pbs_mom unable to chdir to automounted dirs
> >
> >
> >
> > The node OS is CentOS5 as is the nfs server.  The pbs server is running
> > CentOS4.5.  I have rebooted and chanted... :-) :-(
> >
> > Here is my simple pbs script and it does not have absolute paths.  The
> > script will run only after the nfs dirs are somehow mounted on the
> > node.  I have tried it with absolute path names, and it makes no difference.
> >
> > pbs script:
> > #PBS -l nodes=node1048
> > # join stderr and stdout and write the to a file
> > #PBS -j oe
> > #PBS -o test3.o
> >        
> > # cd into the working directory
> > cd $PBS_O_WORKDIR
> > # print out some diagnostic stuff
> > echo Running on host `hostname`
> > echo Directory is `pwd`
> > echo Start time is `date`
> > # run my commands
> >
> > ./dostuff2.pl data.txt > test3.out1
> >
> > # print out some diagnostic stuff
> > echo Stop time is `date`
> >
> >
> >
> > Luke Scharf wrote:
> >   
> >> If it works with the shell, however, the problem almost has to be with
> >> something other than the automounter.
> >>
> >> Are any asbolute paths in the qsub script correct?
> >>
> >> -Luke
> >>
> >>
> >> Luke Scharf wrote:
> >>     
> >>> That looks happy, too.
> >>>
> >>> What is the underlying OS running on the node?
> >>>
> >>> Have you tried just rebooting everything while muttering laments
> >>> about stray alpha-particles to everyone within earshot?
> >>>
> >>> -Luke
> >>>
> >>> Mary Ellen Fitzpatrick wrote:
> >>>       
> >>>> Yeah, that is why I am stumped...   because I can cd to nfs dirs,
> >>>> seems like autofs is working correctly.  But unless the nfs dir is
> >>>> pre-mounted, pbs_mom can not find it.  Very strange...
> >>>>
> >>>> Yes, getent passwd give the correct home dir info
> >>>> [root at node1048 mom_priv]# getent passwd
> >>>> mfitzpat:!!:xxxxxx:xxx:mfitzpat:/fs/userB1/mfitzpat:/bin/bash
> >>>>
> >>>> Luke Scharf wrote:
> >>>>         
> >>>>> Nothing that you mention looks amiss at first glance...
> >>>>>
> >>>>>
> >>>>> Does the "getent passwd" information for the user have a correct
> >>>>> home directory on the node?
> >>>>>
> >>>>> -Luke
> >>>>>
> >>>>>
> >>>>> Mary Ellen Fitzpatrick wrote:
> >>>>>           
> >>>>>> Thanks Luke.
> >>>>>> Right now, my cluster is one node, with additional 50 to be
> >>>>>> brought on-line once I resolve the automount problem.  The job I
> >>>>>> am running is very simple, no nfs load on server.
> >>>>>>
> >>>>>> my $usecp I believe is correct and works properly after the nfs
> >>>>>> dir is mounted.
> >>>>>> $usecp *:/fs/userB1 /fs/userB1
> >>>>>>
> >>>>>> My auto.home file:
> >>>>>> userB1  -rw,hard,intr   userB:/userB/u1
> >>>>>>
> >>>>>> auto.master file:
> >>>>>> #+auto.master
> >>>>>> /fs     /etc/auto.home
> >>>>>>
> >>>>>> I believe it is an automount issue and I need to tweak a parameter
> >>>>>> in a config file.  Not sure which one it is at this point.
> >>>>>>
> >>>>>>
> >>>>>> Luke Scharf wrote:
> >>>>>>             
> >>>>>>> Mary Ellen Fitzpatrick wrote:
> >>>>>>>               
> >>>>>>>> I have my home dirs nfs exported to all of my compute nodes.  I
> >>>>>>>> can log into the nodes and cd the nfs mounted dirs, no problem.
> >>>>>>>> When I submit a job to a node and the automounted nfs dirs are
> >>>>>>>> not mount (timed out), I get the following error:
> >>>>>>>>
> >>>>>>>> Oct 21 16:08:14 node1047 pbs_mom: No such file or directory (2)
> >>>>>>>> in TMomFinalizeChild, PBS: chdir to '/fs/userB1/mfitzpat'
> >>>>>>>> failed: No such file or directory
> >>>>>>>>
> >>>>>>>> If I immediately resubmit the job to the same node, it will
> >>>>>>>> run.  It appears that pbs wants the automounted nfs dirs to be
> >>>>>>>> already mounted, then the job will run.  If I hard mount the nfs
> >>>>>>>> home dirs, I have no problem running the jobs, but I do not want
> >>>>>>>> to do that.
> >>>>>>>>
> >>>>>>>> Any one run into this?  Trying to figure out if it is a torque
> >>>>>>>> issue or automount issue.
> >>>>>>>>                 
> >>>>>>> How big is your cluster?  How capable is the NFS server?  A
> >>>>>>> job-start is likely to create a mountstorm, and generate a lot of
> >>>>>>> I/O.  Some servers can handle it, some can't.
> >>>>>>>
> >>>>>>> Yay for scaling issues!
> >>>>>>>
> >>>>>>> -Luke
> >>>>>>>
> >>>>>>> P.S. I second the suggestion of checking the $usecp value.
> >>>>>>>               
> >
> > --
> > Thanks
> > Mary Ellen
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> >
> >
> >   
> 
> -- 
> Thanks
> Mary Ellen
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
> 




More information about the torqueusers mailing list