[torqueusers] pbs_mom unable to chdir to automounted dirs ---RESOLVED

Stewart.Samuels at sanofi-aventis.com Stewart.Samuels at sanofi-aventis.com
Thu Oct 23 13:02:15 MDT 2008


Interestingly, on one of our roughly 100 node clusters, we are using
autofs with soft mounts.  We too use the -g option of the automount and
to date, for almost a year now, we have experienced no issues.

	Cheers,
	Stewart 

-----Original Message-----
From: torqueusers-bounces at supercluster.org
[mailto:torqueusers-bounces at supercluster.org] On Behalf Of Mary Ellen
Fitzpatrick
Sent: Thursday, October 23, 2008 1:32 PM
To: Garrick
Cc: Greenseid, Joseph M.; torqueusers at supercluster.org
Subject: Re: [torqueusers] pbs_mom unable to chdir to automounted dirs
---RESOLVED

Good news...hard mounting the nfs dirs on the compute nodes worked.  A
couple glitches along the way, but I got it to work.
had to turn off autofs on all nodes.

Thanks to everyone to your input/suggestions.
Mary Ellen

Mary Ellen Fitzpatrick wrote:
> Thanks.  I am testing th hard nfs mounts on the compute nodes now and 
> will, hoping post my successful results.
>
> Garrick wrote:
>   
>> You *want* the nodes to hang while nfs server is rebooting.  The 
>> alternative is to have all apps exit.
>>
>> HPCC/Linux Systems Admin
>>
>> On Oct 23, 2008, at 7:13 AM, Mary Ellen Fitzpatrick <mfitzpat at bu.edu>
>> wrote:
>>
>>     
>>> The reason I do not want to hard mount, is that if I reboot my nfs 
>>> server then it will hang, as all of the nodes will be hard mounted 
>>> to that server. And I believe it is less load on the nfs server if 
>>> automount servers the dirs only requested instead of all nfs dirs.
>>>
>>> Greenseid, Joseph M. wrote:
>>>       
>>>> Why don't you want to hard mount NFS directories on the compute 
>>>> nodes?  What problem is this going to cause you?
>>>> --Joe
>>>>
>>>> ________________________________
>>>>
>>>> From: Mary Ellen Fitzpatrick [mailto:mfitzpat at bu.edu]
>>>> Sent: Wed 10/22/2008 3:20 PM
>>>> To: Greenseid, Joseph M.
>>>> Cc: Luke Scharf; torqueusers at supercluster.org
>>>> Subject: Re: [torqueusers] pbs_mom unable to chdir to automounted 
>>>> dirs
>>>>
>>>>
>>>>
>>>> Good thought... How would I slow down pbs_mom, I tried putting a 
>>>> sleep command in my pbs script and as Luke suggested "ls $HOME", no
>>>> dice.   Do
>>>> I need to edit the pbs_mom daemon?
>>>>
>>>> I guess another hack would be to mount (via: cd nfsdir) the nfs 
>>>> dirs on the compute nodes, but then after the automounter timed out

>>>> or reboot, I would be in the same situation.  Or to hard mount the 
>>>> nfs dirs (do not want to do this!!)
>>>>
>>>> Appreciate your help.
>>>>
>>>> Greenseid, Joseph M. wrote:
>>>>
>>>>         
>>>>> I don't have a real useful suggestion, but just a thought -- could

>>>>> it simply be a timing issue in that pbs_mom is trying to stat a 
>>>>> file or the directory before it's been fully mounted?  It may take

>>>>> a second to get the directory mounted if it wasn't already, and 
>>>>> maybe PBS is too fast for the auto-mounter, esp if the NFS server 
>>>>> is under some sort of load and could be taking a little longer to 
>>>>> respond than normal?
>>>>>
>>>>> --Joe
>>>>>
>>>>> ________________________________
>>>>>
>>>>> From: torqueusers-bounces at supercluster.org on behalf of Mary Ellen

>>>>> Fitzpatrick
>>>>> Sent: Wed 10/22/2008 2:53 PM
>>>>> To: Luke Scharf
>>>>> Cc: torqueusers at supercluster.org
>>>>> Subject: Re: [torqueusers] pbs_mom unable to chdir to automounted 
>>>>> dirs
>>>>>
>>>>>
>>>>>
>>>>> The node OS is CentOS5 as is the nfs server.  The pbs server is 
>>>>> running CentOS4.5.  I have rebooted and chanted... :-) :-(
>>>>>
>>>>> Here is my simple pbs script and it does not have absolute paths.

>>>>> The script will run only after the nfs dirs are somehow mounted on

>>>>> the node.  I have tried it with absolute path names, and it makes 
>>>>> no difference.
>>>>>
>>>>> pbs script:
>>>>> #PBS -l nodes=node1048
>>>>> # join stderr and stdout and write the to a file #PBS -j oe #PBS 
>>>>> -o test3.o
>>>>>      # cd into the working directory cd $PBS_O_WORKDIR # print out

>>>>> some diagnostic stuff echo Running on host `hostname` echo 
>>>>> Directory is `pwd` echo Start time is `date` # run my commands
>>>>>
>>>>> ./dostuff2.pl data.txt > test3.out1
>>>>>
>>>>> # print out some diagnostic stuff
>>>>> echo Stop time is `date`
>>>>>
>>>>>
>>>>>
>>>>> Luke Scharf wrote:
>>>>>
>>>>>           
>>>>>> If it works with the shell, however, the problem almost has to be

>>>>>> with something other than the automounter.
>>>>>>
>>>>>> Are any asbolute paths in the qsub script correct?
>>>>>>
>>>>>> -Luke
>>>>>>
>>>>>>
>>>>>> Luke Scharf wrote:
>>>>>>
>>>>>>             
>>>>>>> That looks happy, too.
>>>>>>>
>>>>>>> What is the underlying OS running on the node?
>>>>>>>
>>>>>>> Have you tried just rebooting everything while muttering laments

>>>>>>> about stray alpha-particles to everyone within earshot?
>>>>>>>
>>>>>>> -Luke
>>>>>>>
>>>>>>> Mary Ellen Fitzpatrick wrote:
>>>>>>>
>>>>>>>               
>>>>>>>> Yeah, that is why I am stumped...   because I can cd to nfs
dirs,
>>>>>>>> seems like autofs is working correctly.  But unless the nfs dir

>>>>>>>> is pre-mounted, pbs_mom can not find it.  Very strange...
>>>>>>>>
>>>>>>>> Yes, getent passwd give the correct home dir info
>>>>>>>> [root at node1048 mom_priv]# getent passwd 
>>>>>>>> mfitzpat:!!:xxxxxx:xxx:mfitzpat:/fs/userB1/mfitzpat:/bin/bash
>>>>>>>>
>>>>>>>> Luke Scharf wrote:
>>>>>>>>
>>>>>>>>                 
>>>>>>>>> Nothing that you mention looks amiss at first glance...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Does the "getent passwd" information for the user have a 
>>>>>>>>> correct home directory on the node?
>>>>>>>>>
>>>>>>>>> -Luke
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Mary Ellen Fitzpatrick wrote:
>>>>>>>>>
>>>>>>>>>                   
>>>>>>>>>> Thanks Luke.
>>>>>>>>>> Right now, my cluster is one node, with additional 50 to be 
>>>>>>>>>> brought on-line once I resolve the automount problem.  The 
>>>>>>>>>> job I am running is very simple, no nfs load on server.
>>>>>>>>>>
>>>>>>>>>> my $usecp I believe is correct and works properly after the 
>>>>>>>>>> nfs dir is mounted.
>>>>>>>>>> $usecp *:/fs/userB1 /fs/userB1
>>>>>>>>>>
>>>>>>>>>> My auto.home file:
>>>>>>>>>> userB1  -rw,hard,intr   userB:/userB/u1
>>>>>>>>>>
>>>>>>>>>> auto.master file:
>>>>>>>>>> #+auto.master
>>>>>>>>>> /fs     /etc/auto.home
>>>>>>>>>>
>>>>>>>>>> I believe it is an automount issue and I need to tweak a 
>>>>>>>>>> parameter in a config file.  Not sure which one it is at this

>>>>>>>>>> point.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Luke Scharf wrote:
>>>>>>>>>>
>>>>>>>>>>                     
>>>>>>>>>>> Mary Ellen Fitzpatrick wrote:
>>>>>>>>>>>
>>>>>>>>>>>                       
>>>>>>>>>>>> I have my home dirs nfs exported to all of my compute 
>>>>>>>>>>>> nodes.  I can log into the nodes and cd the nfs mounted
dirs, no problem.
>>>>>>>>>>>> When I submit a job to a node and the automounted nfs dirs 
>>>>>>>>>>>> are not mount (timed out), I get the following error:
>>>>>>>>>>>>
>>>>>>>>>>>> Oct 21 16:08:14 node1047 pbs_mom: No such file or directory

>>>>>>>>>>>> (2) in TMomFinalizeChild, PBS: chdir to
'/fs/userB1/mfitzpat'
>>>>>>>>>>>> failed: No such file or directory
>>>>>>>>>>>>
>>>>>>>>>>>> If I immediately resubmit the job to the same node, it will

>>>>>>>>>>>> run.  It appears that pbs wants the automounted nfs dirs to

>>>>>>>>>>>> be already mounted, then the job will run.  If I hard mount

>>>>>>>>>>>> the nfs home dirs, I have no problem running the jobs, but 
>>>>>>>>>>>> I do not want to do that.
>>>>>>>>>>>>
>>>>>>>>>>>> Any one run into this?  Trying to figure out if it is a 
>>>>>>>>>>>> torque issue or automount issue.
>>>>>>>>>>>>
>>>>>>>>>>>>                         
>>>>>>>>>>> How big is your cluster?  How capable is the NFS server?  A 
>>>>>>>>>>> job-start is likely to create a mountstorm, and generate a 
>>>>>>>>>>> lot of I/O.  Some servers can handle it, some can't.
>>>>>>>>>>>
>>>>>>>>>>> Yay for scaling issues!
>>>>>>>>>>>
>>>>>>>>>>> -Luke
>>>>>>>>>>>
>>>>>>>>>>> P.S. I second the suggestion of checking the $usecp value.
>>>>>>>>>>>
>>>>>>>>>>>                       
>>>>> --
>>>>> Thanks
>>>>> Mary Ellen
>>>>>
>>>>> _______________________________________________
>>>>> torqueusers mailing list
>>>>> torqueusers at supercluster.org
>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>           
>>>> --
>>>> Thanks
>>>> Mary Ellen
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>         
>>> --
>>> Thanks
>>> Mary Ellen
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>       
>
>   

--
Thanks
Mary Ellen

_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list