[torqueusers] pbs_mom unable to chdir to automounted dirs

Mary Ellen Fitzpatrick mfitzpat at bu.edu
Wed Oct 22 12:37:07 MDT 2008


Yeah, that is why I am stumped...   because I can cd to nfs dirs, seems 
like autofs is working correctly.  But unless the nfs dir is 
pre-mounted, pbs_mom can not find it.  Very strange...

Yes, getent passwd give the correct home dir info
[root at node1048 mom_priv]# getent passwd
mfitzpat:!!:xxxxxx:xxx:mfitzpat:/fs/userB1/mfitzpat:/bin/bash

Luke Scharf wrote:
> Nothing that you mention looks amiss at first glance...
>
>
> Does the "getent passwd" information for the user have a correct home 
> directory on the node?
>
> -Luke
>
>
> Mary Ellen Fitzpatrick wrote:
>> Thanks Luke.
>> Right now, my cluster is one node, with additional 50 to be brought 
>> on-line once I resolve the automount problem.  The job I am running 
>> is very simple, no nfs load on server.
>>
>> my $usecp I believe is correct and works properly after the nfs dir 
>> is mounted.
>> $usecp *:/fs/userB1 /fs/userB1
>>
>> My auto.home file:
>> userB1  -rw,hard,intr   userB:/userB/u1
>>
>> auto.master file:
>> #+auto.master
>> /fs     /etc/auto.home
>>
>> I believe it is an automount issue and I need to tweak a parameter in 
>> a config file.  Not sure which one it is at this point.
>>
>>
>> Luke Scharf wrote:
>>> Mary Ellen Fitzpatrick wrote:
>>>> I have my home dirs nfs exported to all of my compute nodes.  I can 
>>>> log into the nodes and cd the nfs mounted dirs, no problem. When I 
>>>> submit a job to a node and the automounted nfs dirs are not mount 
>>>> (timed out), I get the following error:
>>>>
>>>> Oct 21 16:08:14 node1047 pbs_mom: No such file or directory (2) in 
>>>> TMomFinalizeChild, PBS: chdir to '/fs/userB1/mfitzpat' failed: No 
>>>> such file or directory
>>>>
>>>> If I immediately resubmit the job to the same node, it will run.  
>>>> It appears that pbs wants the automounted nfs dirs to be already 
>>>> mounted, then the job will run.  If I hard mount the nfs home dirs, 
>>>> I have no problem running the jobs, but I do not want to do that.
>>>>
>>>> Any one run into this?  Trying to figure out if it is a torque 
>>>> issue or automount issue.
>>>
>>> How big is your cluster?  How capable is the NFS server?  A 
>>> job-start is likely to create a mountstorm, and generate a lot of 
>>> I/O.  Some servers can handle it, some can't.
>>>
>>> Yay for scaling issues!
>>>
>>> -Luke
>>>
>>> P.S. I second the suggestion of checking the $usecp value.
>>
>

-- 
Thanks
Mary Ellen



More information about the torqueusers mailing list