[torqueusers] pbs_mom unable to chdir to automounted dirs

Garrick garrick at usc.edu
Thu Oct 23 09:18:06 MDT 2008


You *want* the nodes to hang while nfs server is rebooting.  The  
alternative is to have all apps exit.

HPCC/Linux Systems Admin

On Oct 23, 2008, at 7:13 AM, Mary Ellen Fitzpatrick <mfitzpat at bu.edu>  
wrote:

> The reason I do not want to hard mount, is that if I reboot my nfs  
> server then it will hang, as all of the nodes will be hard mounted  
> to that server. And I believe it is less load on the nfs server if  
> automount servers the dirs only requested instead of all nfs dirs.
>
> Greenseid, Joseph M. wrote:
>> Why don't you want to hard mount NFS directories on the compute  
>> nodes?  What problem is this going to cause you?
>> --Joe
>>
>> ________________________________
>>
>> From: Mary Ellen Fitzpatrick [mailto:mfitzpat at bu.edu]
>> Sent: Wed 10/22/2008 3:20 PM
>> To: Greenseid, Joseph M.
>> Cc: Luke Scharf; torqueusers at supercluster.org
>> Subject: Re: [torqueusers] pbs_mom unable to chdir to automounted  
>> dirs
>>
>>
>>
>> Good thought... How would I slow down pbs_mom, I tried putting a  
>> sleep
>> command in my pbs script and as Luke suggested "ls $HOME", no  
>> dice.   Do
>> I need to edit the pbs_mom daemon?
>>
>> I guess another hack would be to mount (via: cd nfsdir) the nfs  
>> dirs on
>> the compute nodes, but then after the automounter timed out or  
>> reboot, I
>> would be in the same situation.  Or to hard mount the nfs dirs (do  
>> not
>> want to do this!!)
>>
>> Appreciate your help.
>>
>> Greenseid, Joseph M. wrote:
>>
>>> I don't have a real useful suggestion, but just a thought -- could  
>>> it simply be a timing issue in that pbs_mom is trying to stat a  
>>> file or the directory before it's been fully mounted?  It may take  
>>> a second to get the directory mounted if it wasn't already, and  
>>> maybe PBS is too fast for the auto-mounter, esp if the NFS server  
>>> is under some sort of load and could be taking a little longer to  
>>> respond than normal?
>>>
>>> --Joe
>>>
>>> ________________________________
>>>
>>> From: torqueusers-bounces at supercluster.org on behalf of Mary Ellen  
>>> Fitzpatrick
>>> Sent: Wed 10/22/2008 2:53 PM
>>> To: Luke Scharf
>>> Cc: torqueusers at supercluster.org
>>> Subject: Re: [torqueusers] pbs_mom unable to chdir to automounted  
>>> dirs
>>>
>>>
>>>
>>> The node OS is CentOS5 as is the nfs server.  The pbs server is  
>>> running
>>> CentOS4.5.  I have rebooted and chanted... :-) :-(
>>>
>>> Here is my simple pbs script and it does not have absolute paths.   
>>> The
>>> script will run only after the nfs dirs are somehow mounted on the
>>> node.  I have tried it with absolute path names, and it makes no  
>>> difference.
>>>
>>> pbs script:
>>> #PBS -l nodes=node1048
>>> # join stderr and stdout and write the to a file
>>> #PBS -j oe
>>> #PBS -o test3.o
>>>      # cd into the working directory
>>> cd $PBS_O_WORKDIR
>>> # print out some diagnostic stuff
>>> echo Running on host `hostname`
>>> echo Directory is `pwd`
>>> echo Start time is `date`
>>> # run my commands
>>>
>>> ./dostuff2.pl data.txt > test3.out1
>>>
>>> # print out some diagnostic stuff
>>> echo Stop time is `date`
>>>
>>>
>>>
>>> Luke Scharf wrote:
>>>
>>>> If it works with the shell, however, the problem almost has to be  
>>>> with
>>>> something other than the automounter.
>>>>
>>>> Are any asbolute paths in the qsub script correct?
>>>>
>>>> -Luke
>>>>
>>>>
>>>> Luke Scharf wrote:
>>>>
>>>>> That looks happy, too.
>>>>>
>>>>> What is the underlying OS running on the node?
>>>>>
>>>>> Have you tried just rebooting everything while muttering laments
>>>>> about stray alpha-particles to everyone within earshot?
>>>>>
>>>>> -Luke
>>>>>
>>>>> Mary Ellen Fitzpatrick wrote:
>>>>>
>>>>>> Yeah, that is why I am stumped...   because I can cd to nfs dirs,
>>>>>> seems like autofs is working correctly.  But unless the nfs dir  
>>>>>> is
>>>>>> pre-mounted, pbs_mom can not find it.  Very strange...
>>>>>>
>>>>>> Yes, getent passwd give the correct home dir info
>>>>>> [root at node1048 mom_priv]# getent passwd
>>>>>> mfitzpat:!!:xxxxxx:xxx:mfitzpat:/fs/userB1/mfitzpat:/bin/bash
>>>>>>
>>>>>> Luke Scharf wrote:
>>>>>>
>>>>>>> Nothing that you mention looks amiss at first glance...
>>>>>>>
>>>>>>>
>>>>>>> Does the "getent passwd" information for the user have a correct
>>>>>>> home directory on the node?
>>>>>>>
>>>>>>> -Luke
>>>>>>>
>>>>>>>
>>>>>>> Mary Ellen Fitzpatrick wrote:
>>>>>>>
>>>>>>>> Thanks Luke.
>>>>>>>> Right now, my cluster is one node, with additional 50 to be
>>>>>>>> brought on-line once I resolve the automount problem.  The  
>>>>>>>> job I
>>>>>>>> am running is very simple, no nfs load on server.
>>>>>>>>
>>>>>>>> my $usecp I believe is correct and works properly after the nfs
>>>>>>>> dir is mounted.
>>>>>>>> $usecp *:/fs/userB1 /fs/userB1
>>>>>>>>
>>>>>>>> My auto.home file:
>>>>>>>> userB1  -rw,hard,intr   userB:/userB/u1
>>>>>>>>
>>>>>>>> auto.master file:
>>>>>>>> #+auto.master
>>>>>>>> /fs     /etc/auto.home
>>>>>>>>
>>>>>>>> I believe it is an automount issue and I need to tweak a  
>>>>>>>> parameter
>>>>>>>> in a config file.  Not sure which one it is at this point.
>>>>>>>>
>>>>>>>>
>>>>>>>> Luke Scharf wrote:
>>>>>>>>
>>>>>>>>> Mary Ellen Fitzpatrick wrote:
>>>>>>>>>
>>>>>>>>>> I have my home dirs nfs exported to all of my compute  
>>>>>>>>>> nodes.  I
>>>>>>>>>> can log into the nodes and cd the nfs mounted dirs, no  
>>>>>>>>>> problem.
>>>>>>>>>> When I submit a job to a node and the automounted nfs dirs  
>>>>>>>>>> are
>>>>>>>>>> not mount (timed out), I get the following error:
>>>>>>>>>>
>>>>>>>>>> Oct 21 16:08:14 node1047 pbs_mom: No such file or directory  
>>>>>>>>>> (2)
>>>>>>>>>> in TMomFinalizeChild, PBS: chdir to '/fs/userB1/mfitzpat'
>>>>>>>>>> failed: No such file or directory
>>>>>>>>>>
>>>>>>>>>> If I immediately resubmit the job to the same node, it will
>>>>>>>>>> run.  It appears that pbs wants the automounted nfs dirs to  
>>>>>>>>>> be
>>>>>>>>>> already mounted, then the job will run.  If I hard mount  
>>>>>>>>>> the nfs
>>>>>>>>>> home dirs, I have no problem running the jobs, but I do not  
>>>>>>>>>> want
>>>>>>>>>> to do that.
>>>>>>>>>>
>>>>>>>>>> Any one run into this?  Trying to figure out if it is a  
>>>>>>>>>> torque
>>>>>>>>>> issue or automount issue.
>>>>>>>>>>
>>>>>>>>> How big is your cluster?  How capable is the NFS server?  A
>>>>>>>>> job-start is likely to create a mountstorm, and generate a  
>>>>>>>>> lot of
>>>>>>>>> I/O.  Some servers can handle it, some can't.
>>>>>>>>>
>>>>>>>>> Yay for scaling issues!
>>>>>>>>>
>>>>>>>>> -Luke
>>>>>>>>>
>>>>>>>>> P.S. I second the suggestion of checking the $usecp value.
>>>>>>>>>
>>> --
>>> Thanks
>>> Mary Ellen
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>>
>>>
>>
>> --
>> Thanks
>> Mary Ellen
>>
>>
>>
>>
>>
>
> -- 
> Thanks
> Mary Ellen
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list