[torqueusers] pbs_mom unable to chdir to automounted dirs
Mary Ellen Fitzpatrick
mfitzpat at bu.edu
Thu Oct 23 08:13:00 MDT 2008
The reason I do not want to hard mount, is that if I reboot my nfs
server then it will hang, as all of the nodes will be hard mounted to
that server. And I believe it is less load on the nfs server if
automount servers the dirs only requested instead of all nfs dirs.
Greenseid, Joseph M. wrote:
> Why don't you want to hard mount NFS directories on the compute nodes? What problem is this going to cause you?
>
> --Joe
>
> ________________________________
>
> From: Mary Ellen Fitzpatrick [mailto:mfitzpat at bu.edu]
> Sent: Wed 10/22/2008 3:20 PM
> To: Greenseid, Joseph M.
> Cc: Luke Scharf; torqueusers at supercluster.org
> Subject: Re: [torqueusers] pbs_mom unable to chdir to automounted dirs
>
>
>
> Good thought... How would I slow down pbs_mom, I tried putting a sleep
> command in my pbs script and as Luke suggested "ls $HOME", no dice. Do
> I need to edit the pbs_mom daemon?
>
> I guess another hack would be to mount (via: cd nfsdir) the nfs dirs on
> the compute nodes, but then after the automounter timed out or reboot, I
> would be in the same situation. Or to hard mount the nfs dirs (do not
> want to do this!!)
>
> Appreciate your help.
>
> Greenseid, Joseph M. wrote:
>
>> I don't have a real useful suggestion, but just a thought -- could it simply be a timing issue in that pbs_mom is trying to stat a file or the directory before it's been fully mounted? It may take a second to get the directory mounted if it wasn't already, and maybe PBS is too fast for the auto-mounter, esp if the NFS server is under some sort of load and could be taking a little longer to respond than normal?
>>
>> --Joe
>>
>> ________________________________
>>
>> From: torqueusers-bounces at supercluster.org on behalf of Mary Ellen Fitzpatrick
>> Sent: Wed 10/22/2008 2:53 PM
>> To: Luke Scharf
>> Cc: torqueusers at supercluster.org
>> Subject: Re: [torqueusers] pbs_mom unable to chdir to automounted dirs
>>
>>
>>
>> The node OS is CentOS5 as is the nfs server. The pbs server is running
>> CentOS4.5. I have rebooted and chanted... :-) :-(
>>
>> Here is my simple pbs script and it does not have absolute paths. The
>> script will run only after the nfs dirs are somehow mounted on the
>> node. I have tried it with absolute path names, and it makes no difference.
>>
>> pbs script:
>> #PBS -l nodes=node1048
>> # join stderr and stdout and write the to a file
>> #PBS -j oe
>> #PBS -o test3.o
>>
>> # cd into the working directory
>> cd $PBS_O_WORKDIR
>> # print out some diagnostic stuff
>> echo Running on host `hostname`
>> echo Directory is `pwd`
>> echo Start time is `date`
>> # run my commands
>>
>> ./dostuff2.pl data.txt > test3.out1
>>
>> # print out some diagnostic stuff
>> echo Stop time is `date`
>>
>>
>>
>> Luke Scharf wrote:
>>
>>
>>> If it works with the shell, however, the problem almost has to be with
>>> something other than the automounter.
>>>
>>> Are any asbolute paths in the qsub script correct?
>>>
>>> -Luke
>>>
>>>
>>> Luke Scharf wrote:
>>>
>>>
>>>> That looks happy, too.
>>>>
>>>> What is the underlying OS running on the node?
>>>>
>>>> Have you tried just rebooting everything while muttering laments
>>>> about stray alpha-particles to everyone within earshot?
>>>>
>>>> -Luke
>>>>
>>>> Mary Ellen Fitzpatrick wrote:
>>>>
>>>>
>>>>> Yeah, that is why I am stumped... because I can cd to nfs dirs,
>>>>> seems like autofs is working correctly. But unless the nfs dir is
>>>>> pre-mounted, pbs_mom can not find it. Very strange...
>>>>>
>>>>> Yes, getent passwd give the correct home dir info
>>>>> [root at node1048 mom_priv]# getent passwd
>>>>> mfitzpat:!!:xxxxxx:xxx:mfitzpat:/fs/userB1/mfitzpat:/bin/bash
>>>>>
>>>>> Luke Scharf wrote:
>>>>>
>>>>>
>>>>>> Nothing that you mention looks amiss at first glance...
>>>>>>
>>>>>>
>>>>>> Does the "getent passwd" information for the user have a correct
>>>>>> home directory on the node?
>>>>>>
>>>>>> -Luke
>>>>>>
>>>>>>
>>>>>> Mary Ellen Fitzpatrick wrote:
>>>>>>
>>>>>>
>>>>>>> Thanks Luke.
>>>>>>> Right now, my cluster is one node, with additional 50 to be
>>>>>>> brought on-line once I resolve the automount problem. The job I
>>>>>>> am running is very simple, no nfs load on server.
>>>>>>>
>>>>>>> my $usecp I believe is correct and works properly after the nfs
>>>>>>> dir is mounted.
>>>>>>> $usecp *:/fs/userB1 /fs/userB1
>>>>>>>
>>>>>>> My auto.home file:
>>>>>>> userB1 -rw,hard,intr userB:/userB/u1
>>>>>>>
>>>>>>> auto.master file:
>>>>>>> #+auto.master
>>>>>>> /fs /etc/auto.home
>>>>>>>
>>>>>>> I believe it is an automount issue and I need to tweak a parameter
>>>>>>> in a config file. Not sure which one it is at this point.
>>>>>>>
>>>>>>>
>>>>>>> Luke Scharf wrote:
>>>>>>>
>>>>>>>
>>>>>>>> Mary Ellen Fitzpatrick wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>> I have my home dirs nfs exported to all of my compute nodes. I
>>>>>>>>> can log into the nodes and cd the nfs mounted dirs, no problem.
>>>>>>>>> When I submit a job to a node and the automounted nfs dirs are
>>>>>>>>> not mount (timed out), I get the following error:
>>>>>>>>>
>>>>>>>>> Oct 21 16:08:14 node1047 pbs_mom: No such file or directory (2)
>>>>>>>>> in TMomFinalizeChild, PBS: chdir to '/fs/userB1/mfitzpat'
>>>>>>>>> failed: No such file or directory
>>>>>>>>>
>>>>>>>>> If I immediately resubmit the job to the same node, it will
>>>>>>>>> run. It appears that pbs wants the automounted nfs dirs to be
>>>>>>>>> already mounted, then the job will run. If I hard mount the nfs
>>>>>>>>> home dirs, I have no problem running the jobs, but I do not want
>>>>>>>>> to do that.
>>>>>>>>>
>>>>>>>>> Any one run into this? Trying to figure out if it is a torque
>>>>>>>>> issue or automount issue.
>>>>>>>>>
>>>>>>>>>
>>>>>>>> How big is your cluster? How capable is the NFS server? A
>>>>>>>> job-start is likely to create a mountstorm, and generate a lot of
>>>>>>>> I/O. Some servers can handle it, some can't.
>>>>>>>>
>>>>>>>> Yay for scaling issues!
>>>>>>>>
>>>>>>>> -Luke
>>>>>>>>
>>>>>>>> P.S. I second the suggestion of checking the $usecp value.
>>>>>>>>
>>>>>>>>
>> --
>> Thanks
>> Mary Ellen
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>>
>>
>>
>
> --
> Thanks
> Mary Ellen
>
>
>
>
>
--
Thanks
Mary Ellen
More information about the torqueusers
mailing list