[torqueusers] Torque submission host giving an error
prakash.velayutham at cchmc.org
Wed Dec 17 12:01:08 MST 2008
I think I know where the issue is. It has nothing to do with Torque
(as thought) and something to do with a cron job that is running in
the submission host that traverses a NFSv4 tree and sets the ACL on
those 5 minutes after every hour. And the Torque error messages I get
are exactly at that time. My assumption is that the folder tree is
really big and as I do a "find" it probably takes up a lot of file
descriptors and hence qsub isn't getting one to talk to the server. Is
this something that is possible?
I have since moved the cron job to a different host and have not seen
the error yet. Will report back in a day or 2 with an update.
Thanks a lot,
On Dec 15, 2008, at 5:15 PM, Yang Wang wrote:
> Then try this cmd to see if the pbs_if can talk with pbs_server
> pbs_iff -t ServerName 15001
> echo $?
> It seems some service is using this port. You may want to find out,
> and kill that one (I do not know how to find it out).
> Good luck.
> -----Original Message-----
> From: Prakash Velayutham [mailto:prakash.velayutham at cchmc.org]
> Sent: Monday, December 15, 2008 4:55 PM
> To: yang.wang at agencourt.com
> Cc: torqueusers Users
> Subject: Re: [torqueusers] Torque submission host giving an error
> Thanks for the suggestion. It is of the correct permission.
> -rwsr-xr-x 1 root root 20880 Nov 12 11:21 /usr/local/torque-2.1.10/
> Any other ideas?
> On Dec 15, 2008, at 4:48 PM, Yang Wang wrote:
>> It seems the issue is related to the pbs_if sticky-bit setting.
>> Could you make sure the pbs_if is set OK. Newer version sets this
>> during the installation.
>> "chmod 4755 /opt/torque/sbin/pbs_if"
>> Hope this helps.
>> -----Original Message-----
>> From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org
>> ] On Behalf Of Prakash Velayutham
>> Sent: Monday, December 15, 2008 4:22 PM
>> To: torqueusers Users
>> Subject: [torqueusers] Torque submission host giving an error
>> Torque - 2.1.10
>> Recently I have been getting errors like this for some of the jobs
>> being submitted from one server to the Torque server.
>> pbs_iff: cannot connect to bmiclustersvc4.cchmc.org:15001 - fatal
>> error, errno=98 (Address already in use)
>> No Permission.
>> qsub: cannot connect to server bmiclustersvc4.cchmc.org (errno=15007)
>> Would anyone here know what could be causing this? Obviously, Torque
>> server does not have any logs about this as the submission host never
>> succeeded in connecting the Torque server. And I don't see any error
>> in the submission host's logs too.
>> torqueusers mailing list
>> torqueusers at supercluster.org
More information about the torqueusers