[torqueusers] pbs_mom unable to chdir to automounted dirs

Mary Ellen Fitzpatrick mfitzpat at bu.edu
Thu Oct 23 08:13:00 MDT 2008


The reason I do not want to hard mount, is that if I reboot my nfs 
server then it will hang, as all of the nodes will be hard mounted to 
that server. And I believe it is less load on the nfs server if 
automount servers the dirs only requested instead of all nfs dirs.

Greenseid, Joseph M. wrote:
> Why don't you want to hard mount NFS directories on the compute nodes?  What problem is this going to cause you?
>  
> --Joe
>
> ________________________________
>
> From: Mary Ellen Fitzpatrick [mailto:mfitzpat at bu.edu]
> Sent: Wed 10/22/2008 3:20 PM
> To: Greenseid, Joseph M.
> Cc: Luke Scharf; torqueusers at supercluster.org
> Subject: Re: [torqueusers] pbs_mom unable to chdir to automounted dirs
>
>
>
> Good thought... How would I slow down pbs_mom, I tried putting a sleep
> command in my pbs script and as Luke suggested "ls $HOME", no dice.   Do
> I need to edit the pbs_mom daemon?
>
> I guess another hack would be to mount (via: cd nfsdir) the nfs dirs on
> the compute nodes, but then after the automounter timed out or reboot, I
> would be in the same situation.  Or to hard mount the nfs dirs (do not
> want to do this!!)
>
> Appreciate your help.
>
> Greenseid, Joseph M. wrote:
>   
>> I don't have a real useful suggestion, but just a thought -- could it simply be a timing issue in that pbs_mom is trying to stat a file or the directory before it's been fully mounted?  It may take a second to get the directory mounted if it wasn't already, and maybe PBS is too fast for the auto-mounter, esp if the NFS server is under some sort of load and could be taking a little longer to respond than normal?
>>
>> --Joe
>>
>> ________________________________
>>
>> From: torqueusers-bounces at supercluster.org on behalf of Mary Ellen Fitzpatrick
>> Sent: Wed 10/22/2008 2:53 PM
>> To: Luke Scharf
>> Cc: torqueusers at supercluster.org
>> Subject: Re: [torqueusers] pbs_mom unable to chdir to automounted dirs
>>
>>
>>
>> The node OS is CentOS5 as is the nfs server.  The pbs server is running
>> CentOS4.5.  I have rebooted and chanted... :-) :-(
>>
>> Here is my simple pbs script and it does not have absolute paths.  The
>> script will run only after the nfs dirs are somehow mounted on the
>> node.  I have tried it with absolute path names, and it makes no difference.
>>
>> pbs script:
>> #PBS -l nodes=node1048
>> # join stderr and stdout and write the to a file
>> #PBS -j oe
>> #PBS -o test3.o
>>       
>> # cd into the working directory
>> cd $PBS_O_WORKDIR
>> # print out some diagnostic stuff
>> echo Running on host `hostname`
>> echo Directory is `pwd`
>> echo Start time is `date`
>> # run my commands
>>
>> ./dostuff2.pl data.txt > test3.out1
>>
>> # print out some diagnostic stuff
>> echo Stop time is `date`
>>
>>
>>
>> Luke Scharf wrote:
>>  
>>     
>>> If it works with the shell, however, the problem almost has to be with
>>> something other than the automounter.
>>>
>>> Are any asbolute paths in the qsub script correct?
>>>
>>> -Luke
>>>
>>>
>>> Luke Scharf wrote:
>>>    
>>>       
>>>> That looks happy, too.
>>>>
>>>> What is the underlying OS running on the node?
>>>>
>>>> Have you tried just rebooting everything while muttering laments
>>>> about stray alpha-particles to everyone within earshot?
>>>>
>>>> -Luke
>>>>
>>>> Mary Ellen Fitzpatrick wrote:
>>>>      
>>>>         
>>>>> Yeah, that is why I am stumped...   because I can cd to nfs dirs,
>>>>> seems like autofs is working correctly.  But unless the nfs dir is
>>>>> pre-mounted, pbs_mom can not find it.  Very strange...
>>>>>
>>>>> Yes, getent passwd give the correct home dir info
>>>>> [root at node1048 mom_priv]# getent passwd
>>>>> mfitzpat:!!:xxxxxx:xxx:mfitzpat:/fs/userB1/mfitzpat:/bin/bash
>>>>>
>>>>> Luke Scharf wrote:
>>>>>        
>>>>>           
>>>>>> Nothing that you mention looks amiss at first glance...
>>>>>>
>>>>>>
>>>>>> Does the "getent passwd" information for the user have a correct
>>>>>> home directory on the node?
>>>>>>
>>>>>> -Luke
>>>>>>
>>>>>>
>>>>>> Mary Ellen Fitzpatrick wrote:
>>>>>>          
>>>>>>             
>>>>>>> Thanks Luke.
>>>>>>> Right now, my cluster is one node, with additional 50 to be
>>>>>>> brought on-line once I resolve the automount problem.  The job I
>>>>>>> am running is very simple, no nfs load on server.
>>>>>>>
>>>>>>> my $usecp I believe is correct and works properly after the nfs
>>>>>>> dir is mounted.
>>>>>>> $usecp *:/fs/userB1 /fs/userB1
>>>>>>>
>>>>>>> My auto.home file:
>>>>>>> userB1  -rw,hard,intr   userB:/userB/u1
>>>>>>>
>>>>>>> auto.master file:
>>>>>>> #+auto.master
>>>>>>> /fs     /etc/auto.home
>>>>>>>
>>>>>>> I believe it is an automount issue and I need to tweak a parameter
>>>>>>> in a config file.  Not sure which one it is at this point.
>>>>>>>
>>>>>>>
>>>>>>> Luke Scharf wrote:
>>>>>>>            
>>>>>>>               
>>>>>>>> Mary Ellen Fitzpatrick wrote:
>>>>>>>>              
>>>>>>>>                 
>>>>>>>>> I have my home dirs nfs exported to all of my compute nodes.  I
>>>>>>>>> can log into the nodes and cd the nfs mounted dirs, no problem.
>>>>>>>>> When I submit a job to a node and the automounted nfs dirs are
>>>>>>>>> not mount (timed out), I get the following error:
>>>>>>>>>
>>>>>>>>> Oct 21 16:08:14 node1047 pbs_mom: No such file or directory (2)
>>>>>>>>> in TMomFinalizeChild, PBS: chdir to '/fs/userB1/mfitzpat'
>>>>>>>>> failed: No such file or directory
>>>>>>>>>
>>>>>>>>> If I immediately resubmit the job to the same node, it will
>>>>>>>>> run.  It appears that pbs wants the automounted nfs dirs to be
>>>>>>>>> already mounted, then the job will run.  If I hard mount the nfs
>>>>>>>>> home dirs, I have no problem running the jobs, but I do not want
>>>>>>>>> to do that.
>>>>>>>>>
>>>>>>>>> Any one run into this?  Trying to figure out if it is a torque
>>>>>>>>> issue or automount issue.
>>>>>>>>>                
>>>>>>>>>                   
>>>>>>>> How big is your cluster?  How capable is the NFS server?  A
>>>>>>>> job-start is likely to create a mountstorm, and generate a lot of
>>>>>>>> I/O.  Some servers can handle it, some can't.
>>>>>>>>
>>>>>>>> Yay for scaling issues!
>>>>>>>>
>>>>>>>> -Luke
>>>>>>>>
>>>>>>>> P.S. I second the suggestion of checking the $usecp value.
>>>>>>>>              
>>>>>>>>                 
>> --
>> Thanks
>> Mary Ellen
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>>
>>  
>>     
>
> --
> Thanks
> Mary Ellen
>
>
>
>
>   

-- 
Thanks
Mary Ellen



More information about the torqueusers mailing list