[torqueusers] Question about the difference between a node where pbs_server is run and a compute node

Bas van der Vlies basv at sara.nl
Fri Apr 30 09:54:55 MDT 2010

> I can't say that I know what is going on over there.
>> When we run the same utitlies on a 2.4.7 installation the /tmp/.torque-unix is used and no child created.  The problem might be that the server  only handles one connection when /tmp/.torque-unix is used. So when i do i pbs_connect() an let it linger it will eventually timeout, but the pbs_server does not accept connections anymore till the timeout. 
>> That is why i asked if we can use the pbs_iff interface on the pbs_server again!!!  
> ./configure --disable-unixsockets
> Note that, what I wrote it, the unix socket support was a huge performance
> boost and didn't suck up lots of privileged ports. But I can't comment on what
> happened to it in the 2.4.x branch.
Thanks for the pointer.  

>> To trigger is it easy. Just use pbs_connect() and do not close it. We have tested it on:
>>  - debian lenny
>>  - centos 5
> wait... I thought you were having a problem with the basic stuff like qstat? Those always immediately exit.
> I may have been misunderstanding the problem all along.
That is why we noticed this behavior. We have a daemon process running that has a long connect time. The the whole batch system freezes 

>> ------------------------------- 
>> If Found the problem on the pbs_server:
>>  - /var/spool/torque/server_name
>> If this contains a name that is in /etc/hosts it uses the /tmp/.torque-unix mechanism that causes the problem. If is defined a name that must be 'resolved' other then /etc/hosts it will use the pbs_iff interface,  this has no problem because a child process is created. 
>> So the temporary solution is to use a name that must be resolved by DNS.  
> No, it has nothing to do with DNS. Torque has no idea how a name is found. The
> lower-level system libs do that.
We have do some more testing. Our pbs_server has has hostname:
 * login1.irc.sara.nl

In the /var/spool/torque/server_name we have defined:
 * login1.irc.sara.nl

With this setuo .torque-unix is used. 

When we change this name to:
 - login1
 - clearspeed2.irc.sara.nl which is alias

then pbs_iff is used.
17:44 login1.irc.sara.nl:/root 
root# strace qstat 2>&1 | grep pbs_iff
stat("/usr/sbin/pbs_iff", {st_mode=S_IFREG|S_ISUID|0755, st_size=17027, ...}) = 0

Your are right is has nothing to do with DNS setup.  the server name defined in /var/spool/torque/server_name is the trigger.

>> The question is can the unix domain socket handle more the one connection?
> It certainly should. It is just a different transport layer. This is the first
> time I've heard a complaint.

We never trigger this before.  we used the 2.1.X versions and for our grid cluster and national compute cluster we switched to 2.4.7 and experiencing this 
behavior.  It is like a DOS attack for the pbs_server.

Again thanks for the explanation and pointers.

Bas van der Vlies
basv at sara.nl

More information about the torqueusers mailing list