[torqueusers] pbs_mom unable to chdir to automounted dirs
Mary Ellen Fitzpatrick
mfitzpat at bu.edu
Wed Oct 22 12:37:07 MDT 2008
Yeah, that is why I am stumped... because I can cd to nfs dirs, seems
like autofs is working correctly. But unless the nfs dir is
pre-mounted, pbs_mom can not find it. Very strange...
Yes, getent passwd give the correct home dir info
[root at node1048 mom_priv]# getent passwd
mfitzpat:!!:xxxxxx:xxx:mfitzpat:/fs/userB1/mfitzpat:/bin/bash
Luke Scharf wrote:
> Nothing that you mention looks amiss at first glance...
>
>
> Does the "getent passwd" information for the user have a correct home
> directory on the node?
>
> -Luke
>
>
> Mary Ellen Fitzpatrick wrote:
>> Thanks Luke.
>> Right now, my cluster is one node, with additional 50 to be brought
>> on-line once I resolve the automount problem. The job I am running
>> is very simple, no nfs load on server.
>>
>> my $usecp I believe is correct and works properly after the nfs dir
>> is mounted.
>> $usecp *:/fs/userB1 /fs/userB1
>>
>> My auto.home file:
>> userB1 -rw,hard,intr userB:/userB/u1
>>
>> auto.master file:
>> #+auto.master
>> /fs /etc/auto.home
>>
>> I believe it is an automount issue and I need to tweak a parameter in
>> a config file. Not sure which one it is at this point.
>>
>>
>> Luke Scharf wrote:
>>> Mary Ellen Fitzpatrick wrote:
>>>> I have my home dirs nfs exported to all of my compute nodes. I can
>>>> log into the nodes and cd the nfs mounted dirs, no problem. When I
>>>> submit a job to a node and the automounted nfs dirs are not mount
>>>> (timed out), I get the following error:
>>>>
>>>> Oct 21 16:08:14 node1047 pbs_mom: No such file or directory (2) in
>>>> TMomFinalizeChild, PBS: chdir to '/fs/userB1/mfitzpat' failed: No
>>>> such file or directory
>>>>
>>>> If I immediately resubmit the job to the same node, it will run.
>>>> It appears that pbs wants the automounted nfs dirs to be already
>>>> mounted, then the job will run. If I hard mount the nfs home dirs,
>>>> I have no problem running the jobs, but I do not want to do that.
>>>>
>>>> Any one run into this? Trying to figure out if it is a torque
>>>> issue or automount issue.
>>>
>>> How big is your cluster? How capable is the NFS server? A
>>> job-start is likely to create a mountstorm, and generate a lot of
>>> I/O. Some servers can handle it, some can't.
>>>
>>> Yay for scaling issues!
>>>
>>> -Luke
>>>
>>> P.S. I second the suggestion of checking the $usecp value.
>>
>
--
Thanks
Mary Ellen
More information about the torqueusers
mailing list