[torqueusers] Problem with Torque with AMD Opteron and RHEL 3

Leandro Tavares Carneiro leandro at ep.petrobras.com.br
Mon Dec 13 06:24:14 MST 2004


Well, we here use /etc/hosts.equiv for a long time, and we never need to put
the FQDN or IP address. I use only the short name of each node and server.

But, just in case, i made a test without sucess.... I think now i have to go 
to another way. I will try *older* versions of Torque to see whats happens. 
And, if it didint change the results, i will try *ancient* versions.....

Well, whish me luck!

Best Regards,

Leandro Tavares Carneiro
Petrobras TI/TI-E&P/STEP Suporte Tecnico de E&P
Av Chile, 65 sala 1501 EDISE - Rio de Janeiro / RJ
Tel: (0xx21) 2534-1427


Valery Mitsyn wrote:
> Did you try to create /etc/hosts.equiv and put server and all
> interactive nodes (FQND, official and IPaddr) to it?
> 
> On Thu, 9 Dec 2004, Leandro Tavares Carneiro wrote:
> 
>>Bas,
>>
>>I have tried to change te permissions of an home directory for an user to 777
>>and the behavior is the same, but it is worst with the p5 snapshot.
>>
>>With p3, which is the version is working on the other clusters we have here,
>>with the same users, i can run a job with one machine. It works, but when i
>>put more than one, it dosent work....
>>
>>I have done some tests using local user accounts and it works. And, i have
>>exported an home area for this user from an linux server *without* the
>>no_root_squash parameter. By the way, i have user root_squash to enforce that
>>and it works correctly.
>>
>>I think the problem is in another place, and this of chmod the home area or
>>export with no_root_squase a coincidence.
>>
>>I hope someone can help me. I'm in trouble because that cluster.
>>
>>Thanks for your help,
>>
>>Regards,
>>
>>Leandro Tavares Carneiro
>>Petrobras TI/TI-E&P/STEP Suporte Tecnico de E&P
>>Av Chile, 65 sala 1501 EDISE - Rio de Janeiro / RJ
>>Tel: (0xx21) 2534-1427
>>
>>
>>Bas van der Vlies wrote:
>>>Dave Jackson wrote:
>>>>Bas,
>>>>
>>>>  This should be easy to patch but we have so far been unable to
>>>>reproduce it in our lab with or without root squash.  If any site can
>>>>reliably reproduce it and is able to work with us, we can most likely
>>>>correct this today.
>>>>
>>>Dave,
>>>
>>> It is easily to reproduce for me. Just chmod 700 my homedir directory.
>>> Or must i try the new p5 snapshot on on node.
>>>
>>>
>>> We have an timezone difference ;-)
>>>
>>>
>>>>On Wed, 2004-12-08 at 03:58, Bas van der Vlies wrote:
>>>>
>>>>>We at SARA have the same problem. I have turned on root_squash. The
>>>>>problem disappeared it i made my home directory 755. But that is not an
>>>>>real soltion. we are using torque 1.1.0p4
>>>>>
>>>>>        Regards
>>>>>
>>>>>Leandro Tavares Carneiro wrote:
>>>>>
>>>>>>Chris,
>>>>>>
>>>>>>I can see the home directory of all users, but i dont have it
>>>>>>exported with no_root_squas parameter because we dont need it
>>>>>>before, and this home area is served by some NetApp fillers to the
>>>>>>users.
>>>>>>
>>>>>>We have here other clusters with a much larger nodes and we never
>>>>>>had this problem. The other cluster are Xeon and the OS is the old
>>>>>>RedHat. This problem only happen with this Opteron/RHEL WS cluster.
>>>>>>
>>>>>>Thanks for your help,
>>>>>>
>>>>>>Regards,
>>>>>>
>>>>>>Leandro Tavares Carneiro
>>>>>>Petrobras TI/TI-E&P/STEP Suporte Tecnico de E&P
>>>>>>Av Chile, 65 sala 1501 EDISE - Rio de Janeiro / RJ
>>>>>>Tel: (0xx21) 2534-1427
>>>>>>
>>>>>>
>>>>>>Chris Samuel wrote:
>>>>>>
>>>>>>
>>>>>>>On Tue, 7 Dec 2004 10:18 pm, Leandro Tavares Carneiro wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>      I have checked everything in my nodes and server and is
>>>>>>>>everything
>>>>>>>>OK. All the nodes can recognize the user id i'm using and the home
>>>>>>>>directory is mounting, but i still got this error.
>>>>>>>>
>>>>>>>>Dec  7 09:04:32 node002 pbs_mom: scan_for_exiting, cannot chdir to
>>>>>>>>user
>>>>>>>>home directory
>>>>>>>
>>>>>>>
>>>>>>>Are you exporting the users home directories with no_root_squash
>>>>>>>from the NFS server ?
>>>>>>>
>>>>>>>Easiest way to check that is to login to node002 as root and then
>>>>>>>try and cd to the users home directory - if you get a permission
>>>>>>>denied error this is probably what's going on.
>>>>>>>
>>>>>>>A number of folks have reported this recently, it doesn't affect us
>>>>>>>here as we're exporting with no_root_squash (we have total control
>>>>>>>over all clients and server).
>>>>>>>
>>>>>>>The other time we've seen this is after an NFS server crash when
>>>>>>>the clients have stale NFS file handles, again trying the above
>>>>>>>should tell you.
>>>>>>>
>>>>>>>It would be very nice if the pbs_mom reported the value of errno
>>>>>>>and its sys_errlist equivalent. :-)
>>>>>>>
>>>>>>>cheers,
>>>>>>>Chris
>>>>>>>
>>>>>>>
>>>>>>>------------------------------------------------------------------------
>>>>>>>
>>>>>>>
>>>>>>>_______________________________________________
>>>>>>>torqueusers mailing list
>>>>>>>torqueusers at supercluster.org
>>>>>>>http://supercluster.org/mailman/listinfo/torqueusers
>>>>>>_______________________________________________
>>>>>>torqueusers mailing list
>>>>>>torqueusers at supercluster.org
>>>>>>http://supercluster.org/mailman/listinfo/torqueusers
>>>
>>_______________________________________________
>>torqueusers mailing list
>>torqueusers at supercluster.org
>>http://supercluster.org/mailman/listinfo/torqueusers
>>
> 
> Best regards,
>  Valery Mitsyn
> 



More information about the torqueusers mailing list