[torqueusers] pbs_mom unable to chdir to automounted dirs

Greenseid, Joseph M. Joseph.Greenseid at ngc.com
Thu Oct 23 06:52:49 MDT 2008


Why don't you want to hard mount NFS directories on the compute nodes?  What problem is this going to cause you?
 
--Joe

________________________________

From: Mary Ellen Fitzpatrick [mailto:mfitzpat at bu.edu]
Sent: Wed 10/22/2008 3:20 PM
To: Greenseid, Joseph M.
Cc: Luke Scharf; torqueusers at supercluster.org
Subject: Re: [torqueusers] pbs_mom unable to chdir to automounted dirs



Good thought... How would I slow down pbs_mom, I tried putting a sleep
command in my pbs script and as Luke suggested "ls $HOME", no dice.   Do
I need to edit the pbs_mom daemon?

I guess another hack would be to mount (via: cd nfsdir) the nfs dirs on
the compute nodes, but then after the automounter timed out or reboot, I
would be in the same situation.  Or to hard mount the nfs dirs (do not
want to do this!!)

Appreciate your help.

Greenseid, Joseph M. wrote:
> I don't have a real useful suggestion, but just a thought -- could it simply be a timing issue in that pbs_mom is trying to stat a file or the directory before it's been fully mounted?  It may take a second to get the directory mounted if it wasn't already, and maybe PBS is too fast for the auto-mounter, esp if the NFS server is under some sort of load and could be taking a little longer to respond than normal?
> 
> --Joe
>
> ________________________________
>
> From: torqueusers-bounces at supercluster.org on behalf of Mary Ellen Fitzpatrick
> Sent: Wed 10/22/2008 2:53 PM
> To: Luke Scharf
> Cc: torqueusers at supercluster.org
> Subject: Re: [torqueusers] pbs_mom unable to chdir to automounted dirs
>
>
>
> The node OS is CentOS5 as is the nfs server.  The pbs server is running
> CentOS4.5.  I have rebooted and chanted... :-) :-(
>
> Here is my simple pbs script and it does not have absolute paths.  The
> script will run only after the nfs dirs are somehow mounted on the
> node.  I have tried it with absolute path names, and it makes no difference.
>
> pbs script:
> #PBS -l nodes=node1048
> # join stderr and stdout and write the to a file
> #PBS -j oe
> #PBS -o test3.o
>       
> # cd into the working directory
> cd $PBS_O_WORKDIR
> # print out some diagnostic stuff
> echo Running on host `hostname`
> echo Directory is `pwd`
> echo Start time is `date`
> # run my commands
>
> ./dostuff2.pl data.txt > test3.out1
>
> # print out some diagnostic stuff
> echo Stop time is `date`
>
>
>
> Luke Scharf wrote:
>  
>> If it works with the shell, however, the problem almost has to be with
>> something other than the automounter.
>>
>> Are any asbolute paths in the qsub script correct?
>>
>> -Luke
>>
>>
>> Luke Scharf wrote:
>>    
>>> That looks happy, too.
>>>
>>> What is the underlying OS running on the node?
>>>
>>> Have you tried just rebooting everything while muttering laments
>>> about stray alpha-particles to everyone within earshot?
>>>
>>> -Luke
>>>
>>> Mary Ellen Fitzpatrick wrote:
>>>      
>>>> Yeah, that is why I am stumped...   because I can cd to nfs dirs,
>>>> seems like autofs is working correctly.  But unless the nfs dir is
>>>> pre-mounted, pbs_mom can not find it.  Very strange...
>>>>
>>>> Yes, getent passwd give the correct home dir info
>>>> [root at node1048 mom_priv]# getent passwd
>>>> mfitzpat:!!:xxxxxx:xxx:mfitzpat:/fs/userB1/mfitzpat:/bin/bash
>>>>
>>>> Luke Scharf wrote:
>>>>        
>>>>> Nothing that you mention looks amiss at first glance...
>>>>>
>>>>>
>>>>> Does the "getent passwd" information for the user have a correct
>>>>> home directory on the node?
>>>>>
>>>>> -Luke
>>>>>
>>>>>
>>>>> Mary Ellen Fitzpatrick wrote:
>>>>>          
>>>>>> Thanks Luke.
>>>>>> Right now, my cluster is one node, with additional 50 to be
>>>>>> brought on-line once I resolve the automount problem.  The job I
>>>>>> am running is very simple, no nfs load on server.
>>>>>>
>>>>>> my $usecp I believe is correct and works properly after the nfs
>>>>>> dir is mounted.
>>>>>> $usecp *:/fs/userB1 /fs/userB1
>>>>>>
>>>>>> My auto.home file:
>>>>>> userB1  -rw,hard,intr   userB:/userB/u1
>>>>>>
>>>>>> auto.master file:
>>>>>> #+auto.master
>>>>>> /fs     /etc/auto.home
>>>>>>
>>>>>> I believe it is an automount issue and I need to tweak a parameter
>>>>>> in a config file.  Not sure which one it is at this point.
>>>>>>
>>>>>>
>>>>>> Luke Scharf wrote:
>>>>>>            
>>>>>>> Mary Ellen Fitzpatrick wrote:
>>>>>>>              
>>>>>>>> I have my home dirs nfs exported to all of my compute nodes.  I
>>>>>>>> can log into the nodes and cd the nfs mounted dirs, no problem.
>>>>>>>> When I submit a job to a node and the automounted nfs dirs are
>>>>>>>> not mount (timed out), I get the following error:
>>>>>>>>
>>>>>>>> Oct 21 16:08:14 node1047 pbs_mom: No such file or directory (2)
>>>>>>>> in TMomFinalizeChild, PBS: chdir to '/fs/userB1/mfitzpat'
>>>>>>>> failed: No such file or directory
>>>>>>>>
>>>>>>>> If I immediately resubmit the job to the same node, it will
>>>>>>>> run.  It appears that pbs wants the automounted nfs dirs to be
>>>>>>>> already mounted, then the job will run.  If I hard mount the nfs
>>>>>>>> home dirs, I have no problem running the jobs, but I do not want
>>>>>>>> to do that.
>>>>>>>>
>>>>>>>> Any one run into this?  Trying to figure out if it is a torque
>>>>>>>> issue or automount issue.
>>>>>>>>                
>>>>>>> How big is your cluster?  How capable is the NFS server?  A
>>>>>>> job-start is likely to create a mountstorm, and generate a lot of
>>>>>>> I/O.  Some servers can handle it, some can't.
>>>>>>>
>>>>>>> Yay for scaling issues!
>>>>>>>
>>>>>>> -Luke
>>>>>>>
>>>>>>> P.S. I second the suggestion of checking the $usecp value.
>>>>>>>              
>
> --
> Thanks
> Mary Ellen
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
>  

--
Thanks
Mary Ellen



-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20081023/3db4fe84/attachment-0001.html


More information about the torqueusers mailing list