[torqueusers] pbs_mom unable to chdir to automounted dirs

Mary Ellen Fitzpatrick mfitzpat at bu.edu
Thu Oct 23 09:31:06 MDT 2008


Thanks.  I am testing th hard nfs mounts on the compute nodes now and 
will, hoping post my successful results.

Garrick wrote:
>
> You *want* the nodes to hang while nfs server is rebooting.  The 
> alternative is to have all apps exit.
>
> HPCC/Linux Systems Admin
>
> On Oct 23, 2008, at 7:13 AM, Mary Ellen Fitzpatrick <mfitzpat at bu.edu> 
> wrote:
>
>> The reason I do not want to hard mount, is that if I reboot my nfs 
>> server then it will hang, as all of the nodes will be hard mounted to 
>> that server. And I believe it is less load on the nfs server if 
>> automount servers the dirs only requested instead of all nfs dirs.
>>
>> Greenseid, Joseph M. wrote:
>>> Why don't you want to hard mount NFS directories on the compute 
>>> nodes?  What problem is this going to cause you?
>>> --Joe
>>>
>>> ________________________________
>>>
>>> From: Mary Ellen Fitzpatrick [mailto:mfitzpat at bu.edu]
>>> Sent: Wed 10/22/2008 3:20 PM
>>> To: Greenseid, Joseph M.
>>> Cc: Luke Scharf; torqueusers at supercluster.org
>>> Subject: Re: [torqueusers] pbs_mom unable to chdir to automounted dirs
>>>
>>>
>>>
>>> Good thought... How would I slow down pbs_mom, I tried putting a sleep
>>> command in my pbs script and as Luke suggested "ls $HOME", no 
>>> dice.   Do
>>> I need to edit the pbs_mom daemon?
>>>
>>> I guess another hack would be to mount (via: cd nfsdir) the nfs dirs on
>>> the compute nodes, but then after the automounter timed out or 
>>> reboot, I
>>> would be in the same situation.  Or to hard mount the nfs dirs (do not
>>> want to do this!!)
>>>
>>> Appreciate your help.
>>>
>>> Greenseid, Joseph M. wrote:
>>>
>>>> I don't have a real useful suggestion, but just a thought -- could 
>>>> it simply be a timing issue in that pbs_mom is trying to stat a 
>>>> file or the directory before it's been fully mounted?  It may take 
>>>> a second to get the directory mounted if it wasn't already, and 
>>>> maybe PBS is too fast for the auto-mounter, esp if the NFS server 
>>>> is under some sort of load and could be taking a little longer to 
>>>> respond than normal?
>>>>
>>>> --Joe
>>>>
>>>> ________________________________
>>>>
>>>> From: torqueusers-bounces at supercluster.org on behalf of Mary Ellen 
>>>> Fitzpatrick
>>>> Sent: Wed 10/22/2008 2:53 PM
>>>> To: Luke Scharf
>>>> Cc: torqueusers at supercluster.org
>>>> Subject: Re: [torqueusers] pbs_mom unable to chdir to automounted dirs
>>>>
>>>>
>>>>
>>>> The node OS is CentOS5 as is the nfs server.  The pbs server is 
>>>> running
>>>> CentOS4.5.  I have rebooted and chanted... :-) :-(
>>>>
>>>> Here is my simple pbs script and it does not have absolute paths.  The
>>>> script will run only after the nfs dirs are somehow mounted on the
>>>> node.  I have tried it with absolute path names, and it makes no 
>>>> difference.
>>>>
>>>> pbs script:
>>>> #PBS -l nodes=node1048
>>>> # join stderr and stdout and write the to a file
>>>> #PBS -j oe
>>>> #PBS -o test3.o
>>>>      # cd into the working directory
>>>> cd $PBS_O_WORKDIR
>>>> # print out some diagnostic stuff
>>>> echo Running on host `hostname`
>>>> echo Directory is `pwd`
>>>> echo Start time is `date`
>>>> # run my commands
>>>>
>>>> ./dostuff2.pl data.txt > test3.out1
>>>>
>>>> # print out some diagnostic stuff
>>>> echo Stop time is `date`
>>>>
>>>>
>>>>
>>>> Luke Scharf wrote:
>>>>
>>>>> If it works with the shell, however, the problem almost has to be 
>>>>> with
>>>>> something other than the automounter.
>>>>>
>>>>> Are any asbolute paths in the qsub script correct?
>>>>>
>>>>> -Luke
>>>>>
>>>>>
>>>>> Luke Scharf wrote:
>>>>>
>>>>>> That looks happy, too.
>>>>>>
>>>>>> What is the underlying OS running on the node?
>>>>>>
>>>>>> Have you tried just rebooting everything while muttering laments
>>>>>> about stray alpha-particles to everyone within earshot?
>>>>>>
>>>>>> -Luke
>>>>>>
>>>>>> Mary Ellen Fitzpatrick wrote:
>>>>>>
>>>>>>> Yeah, that is why I am stumped...   because I can cd to nfs dirs,
>>>>>>> seems like autofs is working correctly.  But unless the nfs dir is
>>>>>>> pre-mounted, pbs_mom can not find it.  Very strange...
>>>>>>>
>>>>>>> Yes, getent passwd give the correct home dir info
>>>>>>> [root at node1048 mom_priv]# getent passwd
>>>>>>> mfitzpat:!!:xxxxxx:xxx:mfitzpat:/fs/userB1/mfitzpat:/bin/bash
>>>>>>>
>>>>>>> Luke Scharf wrote:
>>>>>>>
>>>>>>>> Nothing that you mention looks amiss at first glance...
>>>>>>>>
>>>>>>>>
>>>>>>>> Does the "getent passwd" information for the user have a correct
>>>>>>>> home directory on the node?
>>>>>>>>
>>>>>>>> -Luke
>>>>>>>>
>>>>>>>>
>>>>>>>> Mary Ellen Fitzpatrick wrote:
>>>>>>>>
>>>>>>>>> Thanks Luke.
>>>>>>>>> Right now, my cluster is one node, with additional 50 to be
>>>>>>>>> brought on-line once I resolve the automount problem.  The job I
>>>>>>>>> am running is very simple, no nfs load on server.
>>>>>>>>>
>>>>>>>>> my $usecp I believe is correct and works properly after the nfs
>>>>>>>>> dir is mounted.
>>>>>>>>> $usecp *:/fs/userB1 /fs/userB1
>>>>>>>>>
>>>>>>>>> My auto.home file:
>>>>>>>>> userB1  -rw,hard,intr   userB:/userB/u1
>>>>>>>>>
>>>>>>>>> auto.master file:
>>>>>>>>> #+auto.master
>>>>>>>>> /fs     /etc/auto.home
>>>>>>>>>
>>>>>>>>> I believe it is an automount issue and I need to tweak a 
>>>>>>>>> parameter
>>>>>>>>> in a config file.  Not sure which one it is at this point.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Luke Scharf wrote:
>>>>>>>>>
>>>>>>>>>> Mary Ellen Fitzpatrick wrote:
>>>>>>>>>>
>>>>>>>>>>> I have my home dirs nfs exported to all of my compute nodes.  I
>>>>>>>>>>> can log into the nodes and cd the nfs mounted dirs, no problem.
>>>>>>>>>>> When I submit a job to a node and the automounted nfs dirs are
>>>>>>>>>>> not mount (timed out), I get the following error:
>>>>>>>>>>>
>>>>>>>>>>> Oct 21 16:08:14 node1047 pbs_mom: No such file or directory (2)
>>>>>>>>>>> in TMomFinalizeChild, PBS: chdir to '/fs/userB1/mfitzpat'
>>>>>>>>>>> failed: No such file or directory
>>>>>>>>>>>
>>>>>>>>>>> If I immediately resubmit the job to the same node, it will
>>>>>>>>>>> run.  It appears that pbs wants the automounted nfs dirs to be
>>>>>>>>>>> already mounted, then the job will run.  If I hard mount the 
>>>>>>>>>>> nfs
>>>>>>>>>>> home dirs, I have no problem running the jobs, but I do not 
>>>>>>>>>>> want
>>>>>>>>>>> to do that.
>>>>>>>>>>>
>>>>>>>>>>> Any one run into this?  Trying to figure out if it is a torque
>>>>>>>>>>> issue or automount issue.
>>>>>>>>>>>
>>>>>>>>>> How big is your cluster?  How capable is the NFS server?  A
>>>>>>>>>> job-start is likely to create a mountstorm, and generate a 
>>>>>>>>>> lot of
>>>>>>>>>> I/O.  Some servers can handle it, some can't.
>>>>>>>>>>
>>>>>>>>>> Yay for scaling issues!
>>>>>>>>>>
>>>>>>>>>> -Luke
>>>>>>>>>>
>>>>>>>>>> P.S. I second the suggestion of checking the $usecp value.
>>>>>>>>>>
>>>> -- 
>>>> Thanks
>>>> Mary Ellen
>>>>
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>
>>>>
>>>>
>>>>
>>>
>>> -- 
>>> Thanks
>>> Mary Ellen
>>>
>>>
>>>
>>>
>>>
>>
>> -- 
>> Thanks
>> Mary Ellen
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers

-- 
Thanks
Mary Ellen



More information about the torqueusers mailing list