[torqueusers] pbs_mom unable to chdir to automounted dirs

Greenseid, Joseph M. Joseph.Greenseid at ngc.com
Wed Oct 22 12:56:40 MDT 2008


I don't have a real useful suggestion, but just a thought -- could it simply be a timing issue in that pbs_mom is trying to stat a file or the directory before it's been fully mounted?  It may take a second to get the directory mounted if it wasn't already, and maybe PBS is too fast for the auto-mounter, esp if the NFS server is under some sort of load and could be taking a little longer to respond than normal?
 
--Joe

________________________________

From: torqueusers-bounces at supercluster.org on behalf of Mary Ellen Fitzpatrick
Sent: Wed 10/22/2008 2:53 PM
To: Luke Scharf
Cc: torqueusers at supercluster.org
Subject: Re: [torqueusers] pbs_mom unable to chdir to automounted dirs



The node OS is CentOS5 as is the nfs server.  The pbs server is running
CentOS4.5.  I have rebooted and chanted... :-) :-(

Here is my simple pbs script and it does not have absolute paths.  The
script will run only after the nfs dirs are somehow mounted on the
node.  I have tried it with absolute path names, and it makes no difference.

pbs script:
#PBS -l nodes=node1048
# join stderr and stdout and write the to a file
#PBS -j oe
#PBS -o test3.o
       
# cd into the working directory
cd $PBS_O_WORKDIR
# print out some diagnostic stuff
echo Running on host `hostname`
echo Directory is `pwd`
echo Start time is `date`
# run my commands

./dostuff2.pl data.txt > test3.out1

# print out some diagnostic stuff
echo Stop time is `date`



Luke Scharf wrote:
> If it works with the shell, however, the problem almost has to be with
> something other than the automounter.
>
> Are any asbolute paths in the qsub script correct?
>
> -Luke
>
>
> Luke Scharf wrote:
>> That looks happy, too.
>>
>> What is the underlying OS running on the node?
>>
>> Have you tried just rebooting everything while muttering laments
>> about stray alpha-particles to everyone within earshot?
>>
>> -Luke
>>
>> Mary Ellen Fitzpatrick wrote:
>>> Yeah, that is why I am stumped...   because I can cd to nfs dirs,
>>> seems like autofs is working correctly.  But unless the nfs dir is
>>> pre-mounted, pbs_mom can not find it.  Very strange...
>>>
>>> Yes, getent passwd give the correct home dir info
>>> [root at node1048 mom_priv]# getent passwd
>>> mfitzpat:!!:xxxxxx:xxx:mfitzpat:/fs/userB1/mfitzpat:/bin/bash
>>>
>>> Luke Scharf wrote:
>>>> Nothing that you mention looks amiss at first glance...
>>>>
>>>>
>>>> Does the "getent passwd" information for the user have a correct
>>>> home directory on the node?
>>>>
>>>> -Luke
>>>>
>>>>
>>>> Mary Ellen Fitzpatrick wrote:
>>>>> Thanks Luke.
>>>>> Right now, my cluster is one node, with additional 50 to be
>>>>> brought on-line once I resolve the automount problem.  The job I
>>>>> am running is very simple, no nfs load on server.
>>>>>
>>>>> my $usecp I believe is correct and works properly after the nfs
>>>>> dir is mounted.
>>>>> $usecp *:/fs/userB1 /fs/userB1
>>>>>
>>>>> My auto.home file:
>>>>> userB1  -rw,hard,intr   userB:/userB/u1
>>>>>
>>>>> auto.master file:
>>>>> #+auto.master
>>>>> /fs     /etc/auto.home
>>>>>
>>>>> I believe it is an automount issue and I need to tweak a parameter
>>>>> in a config file.  Not sure which one it is at this point.
>>>>>
>>>>>
>>>>> Luke Scharf wrote:
>>>>>> Mary Ellen Fitzpatrick wrote:
>>>>>>> I have my home dirs nfs exported to all of my compute nodes.  I
>>>>>>> can log into the nodes and cd the nfs mounted dirs, no problem.
>>>>>>> When I submit a job to a node and the automounted nfs dirs are
>>>>>>> not mount (timed out), I get the following error:
>>>>>>>
>>>>>>> Oct 21 16:08:14 node1047 pbs_mom: No such file or directory (2)
>>>>>>> in TMomFinalizeChild, PBS: chdir to '/fs/userB1/mfitzpat'
>>>>>>> failed: No such file or directory
>>>>>>>
>>>>>>> If I immediately resubmit the job to the same node, it will
>>>>>>> run.  It appears that pbs wants the automounted nfs dirs to be
>>>>>>> already mounted, then the job will run.  If I hard mount the nfs
>>>>>>> home dirs, I have no problem running the jobs, but I do not want
>>>>>>> to do that.
>>>>>>>
>>>>>>> Any one run into this?  Trying to figure out if it is a torque
>>>>>>> issue or automount issue.
>>>>>>
>>>>>> How big is your cluster?  How capable is the NFS server?  A
>>>>>> job-start is likely to create a mountstorm, and generate a lot of
>>>>>> I/O.  Some servers can handle it, some can't.
>>>>>>
>>>>>> Yay for scaling issues!
>>>>>>
>>>>>> -Luke
>>>>>>
>>>>>> P.S. I second the suggestion of checking the $usecp value.
>>>>>
>>>>
>>>
>>
>

--
Thanks
Mary Ellen

_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20081022/f0bb2877/attachment.html


More information about the torqueusers mailing list