[torqueusers] pbs_mom unable to chdir to automounted dirs --- RESOLVED

Greenseid, Joseph M. Joseph.Greenseid at ngc.com
Thu Oct 23 13:31:45 MDT 2008


mary ellen,
 
glad to hear your environment is working.  
 
another random thought -- when you were using soft mounts, how many nfs daemons were you running?  did you up the number of daemons from the default to see if that caused the server to respond faster?
 
--Joe

________________________________

From: Mary Ellen Fitzpatrick [mailto:mfitzpat at bu.edu]
Sent: Thu 10/23/2008 1:32 PM
To: Garrick
Cc: Greenseid, Joseph M.; torqueusers at supercluster.org
Subject: Re: [torqueusers] pbs_mom unable to chdir to automounted dirs --- RESOLVED



Good news...hard mounting the nfs dirs on the compute nodes worked.  A
couple glitches along the way, but I got it to work.
had to turn off autofs on all nodes.

Thanks to everyone to your input/suggestions.
Mary Ellen

Mary Ellen Fitzpatrick wrote:
> Thanks.  I am testing th hard nfs mounts on the compute nodes now and
> will, hoping post my successful results.
>
> Garrick wrote:
>  
>> You *want* the nodes to hang while nfs server is rebooting.  The
>> alternative is to have all apps exit.
>>
>> HPCC/Linux Systems Admin
>>
>> On Oct 23, 2008, at 7:13 AM, Mary Ellen Fitzpatrick <mfitzpat at bu.edu>
>> wrote:
>>
>>    
>>> The reason I do not want to hard mount, is that if I reboot my nfs
>>> server then it will hang, as all of the nodes will be hard mounted to
>>> that server. And I believe it is less load on the nfs server if
>>> automount servers the dirs only requested instead of all nfs dirs.
>>>
>>> Greenseid, Joseph M. wrote:
>>>      
>>>> Why don't you want to hard mount NFS directories on the compute
>>>> nodes?  What problem is this going to cause you?
>>>> --Joe
>>>>
>>>> ________________________________
>>>>
>>>> From: Mary Ellen Fitzpatrick [mailto:mfitzpat at bu.edu]
>>>> Sent: Wed 10/22/2008 3:20 PM
>>>> To: Greenseid, Joseph M.
>>>> Cc: Luke Scharf; torqueusers at supercluster.org
>>>> Subject: Re: [torqueusers] pbs_mom unable to chdir to automounted dirs
>>>>
>>>>
>>>>
>>>> Good thought... How would I slow down pbs_mom, I tried putting a sleep
>>>> command in my pbs script and as Luke suggested "ls $HOME", no
>>>> dice.   Do
>>>> I need to edit the pbs_mom daemon?
>>>>
>>>> I guess another hack would be to mount (via: cd nfsdir) the nfs dirs on
>>>> the compute nodes, but then after the automounter timed out or
>>>> reboot, I
>>>> would be in the same situation.  Or to hard mount the nfs dirs (do not
>>>> want to do this!!)
>>>>
>>>> Appreciate your help.
>>>>
>>>> Greenseid, Joseph M. wrote:
>>>>
>>>>        
>>>>> I don't have a real useful suggestion, but just a thought -- could
>>>>> it simply be a timing issue in that pbs_mom is trying to stat a
>>>>> file or the directory before it's been fully mounted?  It may take
>>>>> a second to get the directory mounted if it wasn't already, and
>>>>> maybe PBS is too fast for the auto-mounter, esp if the NFS server
>>>>> is under some sort of load and could be taking a little longer to
>>>>> respond than normal?
>>>>>
>>>>> --Joe
>>>>>
>>>>> ________________________________
>>>>>
>>>>> From: torqueusers-bounces at supercluster.org on behalf of Mary Ellen
>>>>> Fitzpatrick
>>>>> Sent: Wed 10/22/2008 2:53 PM
>>>>> To: Luke Scharf
>>>>> Cc: torqueusers at supercluster.org
>>>>> Subject: Re: [torqueusers] pbs_mom unable to chdir to automounted dirs
>>>>>
>>>>>
>>>>>
>>>>> The node OS is CentOS5 as is the nfs server.  The pbs server is
>>>>> running
>>>>> CentOS4.5.  I have rebooted and chanted... :-) :-(
>>>>>
>>>>> Here is my simple pbs script and it does not have absolute paths.  The
>>>>> script will run only after the nfs dirs are somehow mounted on the
>>>>> node.  I have tried it with absolute path names, and it makes no
>>>>> difference.
>>>>>
>>>>> pbs script:
>>>>> #PBS -l nodes=node1048
>>>>> # join stderr and stdout and write the to a file
>>>>> #PBS -j oe
>>>>> #PBS -o test3.o
>>>>>      # cd into the working directory
>>>>> cd $PBS_O_WORKDIR
>>>>> # print out some diagnostic stuff
>>>>> echo Running on host `hostname`
>>>>> echo Directory is `pwd`
>>>>> echo Start time is `date`
>>>>> # run my commands
>>>>>
>>>>> ./dostuff2.pl data.txt > test3.out1
>>>>>
>>>>> # print out some diagnostic stuff
>>>>> echo Stop time is `date`
>>>>>
>>>>>
>>>>>
>>>>> Luke Scharf wrote:
>>>>>
>>>>>          
>>>>>> If it works with the shell, however, the problem almost has to be
>>>>>> with
>>>>>> something other than the automounter.
>>>>>>
>>>>>> Are any asbolute paths in the qsub script correct?
>>>>>>
>>>>>> -Luke
>>>>>>
>>>>>>
>>>>>> Luke Scharf wrote:
>>>>>>
>>>>>>            
>>>>>>> That looks happy, too.
>>>>>>>
>>>>>>> What is the underlying OS running on the node?
>>>>>>>
>>>>>>> Have you tried just rebooting everything while muttering laments
>>>>>>> about stray alpha-particles to everyone within earshot?
>>>>>>>
>>>>>>> -Luke
>>>>>>>
>>>>>>> Mary Ellen Fitzpatrick wrote:
>>>>>>>
>>>>>>>              
>>>>>>>> Yeah, that is why I am stumped...   because I can cd to nfs dirs,
>>>>>>>> seems like autofs is working correctly.  But unless the nfs dir is
>>>>>>>> pre-mounted, pbs_mom can not find it.  Very strange...
>>>>>>>>
>>>>>>>> Yes, getent passwd give the correct home dir info
>>>>>>>> [root at node1048 mom_priv]# getent passwd
>>>>>>>> mfitzpat:!!:xxxxxx:xxx:mfitzpat:/fs/userB1/mfitzpat:/bin/bash
>>>>>>>>
>>>>>>>> Luke Scharf wrote:
>>>>>>>>
>>>>>>>>                
>>>>>>>>> Nothing that you mention looks amiss at first glance...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Does the "getent passwd" information for the user have a correct
>>>>>>>>> home directory on the node?
>>>>>>>>>
>>>>>>>>> -Luke
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Mary Ellen Fitzpatrick wrote:
>>>>>>>>>
>>>>>>>>>                  
>>>>>>>>>> Thanks Luke.
>>>>>>>>>> Right now, my cluster is one node, with additional 50 to be
>>>>>>>>>> brought on-line once I resolve the automount problem.  The job I
>>>>>>>>>> am running is very simple, no nfs load on server.
>>>>>>>>>>
>>>>>>>>>> my $usecp I believe is correct and works properly after the nfs
>>>>>>>>>> dir is mounted.
>>>>>>>>>> $usecp *:/fs/userB1 /fs/userB1
>>>>>>>>>>
>>>>>>>>>> My auto.home file:
>>>>>>>>>> userB1  -rw,hard,intr   userB:/userB/u1
>>>>>>>>>>
>>>>>>>>>> auto.master file:
>>>>>>>>>> #+auto.master
>>>>>>>>>> /fs     /etc/auto.home
>>>>>>>>>>
>>>>>>>>>> I believe it is an automount issue and I need to tweak a
>>>>>>>>>> parameter
>>>>>>>>>> in a config file.  Not sure which one it is at this point.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Luke Scharf wrote:
>>>>>>>>>>
>>>>>>>>>>                    
>>>>>>>>>>> Mary Ellen Fitzpatrick wrote:
>>>>>>>>>>>
>>>>>>>>>>>                      
>>>>>>>>>>>> I have my home dirs nfs exported to all of my compute nodes.  I
>>>>>>>>>>>> can log into the nodes and cd the nfs mounted dirs, no problem.
>>>>>>>>>>>> When I submit a job to a node and the automounted nfs dirs are
>>>>>>>>>>>> not mount (timed out), I get the following error:
>>>>>>>>>>>>
>>>>>>>>>>>> Oct 21 16:08:14 node1047 pbs_mom: No such file or directory (2)
>>>>>>>>>>>> in TMomFinalizeChild, PBS: chdir to '/fs/userB1/mfitzpat'
>>>>>>>>>>>> failed: No such file or directory
>>>>>>>>>>>>
>>>>>>>>>>>> If I immediately resubmit the job to the same node, it will
>>>>>>>>>>>> run.  It appears that pbs wants the automounted nfs dirs to be
>>>>>>>>>>>> already mounted, then the job will run.  If I hard mount the
>>>>>>>>>>>> nfs
>>>>>>>>>>>> home dirs, I have no problem running the jobs, but I do not
>>>>>>>>>>>> want
>>>>>>>>>>>> to do that.
>>>>>>>>>>>>
>>>>>>>>>>>> Any one run into this?  Trying to figure out if it is a torque
>>>>>>>>>>>> issue or automount issue.
>>>>>>>>>>>>
>>>>>>>>>>>>                        
>>>>>>>>>>> How big is your cluster?  How capable is the NFS server?  A
>>>>>>>>>>> job-start is likely to create a mountstorm, and generate a
>>>>>>>>>>> lot of
>>>>>>>>>>> I/O.  Some servers can handle it, some can't.
>>>>>>>>>>>
>>>>>>>>>>> Yay for scaling issues!
>>>>>>>>>>>
>>>>>>>>>>> -Luke
>>>>>>>>>>>
>>>>>>>>>>> P.S. I second the suggestion of checking the $usecp value.
>>>>>>>>>>>
>>>>>>>>>>>                      
>>>>> --
>>>>> Thanks
>>>>> Mary Ellen
>>>>>
>>>>> _______________________________________________
>>>>> torqueusers mailing list
>>>>> torqueusers at supercluster.org
>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>          
>>>> --
>>>> Thanks
>>>> Mary Ellen
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>        
>>> --
>>> Thanks
>>> Mary Ellen
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>      
>
>  

--
Thanks
Mary Ellen



-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20081023/b5ac6fdf/attachment-0001.html


More information about the torqueusers mailing list