[torqueusers] Torque submission host giving an error

Prakash Velayutham prakash.velayutham at cchmc.org
Wed Dec 17 12:01:08 MST 2008


Hi,

I think I know where the issue is. It has nothing to do with Torque  
(as thought) and something to do with a cron job that is running in  
the submission host that traverses a NFSv4 tree and sets the ACL on  
those 5 minutes after every hour. And the Torque error messages I get  
are exactly at that time. My assumption is that the folder tree is  
really big and as I do a "find" it probably takes up a lot of file  
descriptors and hence qsub isn't getting one to talk to the server. Is  
this something that is possible?

I have since moved the cron job to a different host and have not seen  
the error yet. Will report back in a day or 2 with an update.

Thanks a lot,
Prakash


On Dec 15, 2008, at 5:15 PM, Yang Wang wrote:

> Then try this cmd to see if the pbs_if can talk with pbs_server
>
> pbs_iff -t ServerName 15001
>
> echo $?
>
> It seems some service is using this port. You may want to find out,  
> and kill that one (I do not know how to find it out).
>
> Good luck.
>
> Yang
>
> -----Original Message-----
> From: Prakash Velayutham [mailto:prakash.velayutham at cchmc.org]
> Sent: Monday, December 15, 2008 4:55 PM
> To: yang.wang at agencourt.com
> Cc: torqueusers Users
> Subject: Re: [torqueusers] Torque submission host giving an error
>
> Hi,
>
> Thanks for the suggestion. It is of the correct permission.
>
> -rwsr-xr-x 1 root root 20880 Nov 12 11:21 /usr/local/torque-2.1.10/
> sbin/pbs_iff
>
> Any other ideas?
>
> Thanks,
> Prakash
>
>
> On Dec 15, 2008, at 4:48 PM, Yang Wang wrote:
>
>> It seems the issue is related to the pbs_if sticky-bit setting.
>> Could you make sure the pbs_if is set OK. Newer version sets this
>> during the installation.
>>
>> "chmod 4755 /opt/torque/sbin/pbs_if"
>>
>> Hope this helps.
>>
>> Yang
>>
>> -----Original Message-----
>> From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org
>> ] On Behalf Of Prakash Velayutham
>> Sent: Monday, December 15, 2008 4:22 PM
>> To: torqueusers Users
>> Subject: [torqueusers] Torque submission host giving an error
>>
>> Hello,
>>
>> Torque - 2.1.10
>>
>> Recently I have been getting errors like this for some of the jobs
>> being submitted from one server to the Torque server.
>>
>> pbs_iff: cannot connect to bmiclustersvc4.cchmc.org:15001 - fatal
>> error, errno=98 (Address already in use)
>> No Permission.
>> qsub: cannot connect to server bmiclustersvc4.cchmc.org (errno=15007)
>>
>> Would anyone here know what could be causing this? Obviously, Torque
>> server does not have any logs about this as the submission host never
>> succeeded in connecting the Torque server. And I don't see any error
>> in the submission host's logs too.
>>
>> Thanks,
>> Prakash
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>
>



More information about the torqueusers mailing list