[torqueusers] Question about the difference between a node where pbs_server is run and a compute node

Ken Nielson knielson at adaptivecomputing.com
Fri Apr 30 08:57:29 MDT 2010


On 04/29/2010 05:07 PM, Garrick Staples wrote:
> On Thu, Apr 29, 2010 at 09:26:06AM +0200, Bas van der Vlies alleged:
>    
>> On 28 apr 2010, at 20:37, Garrick Staples wrote:
>>
>>      
>>> On Wed, Apr 28, 2010 at 08:05:08PM +0200, Bas van der Vlies alleged:
>>>        
>>>> Just a question is there switch in the configure to switch back to the old pbs_iff behaviour?
>>>>          
>>> What old pbs_iff behaviour? The unix domain socket code has been there since the 2.1.x days.
>>>
>>>        
>> Garrick can you explain why our 2.1.11 pbs utilities use the 'pbs_iff' interface to communicate with the pbs_server if they run on the node where the pbs_server is started?  We do not have any problems because a child is created and pbs_server can accept connections again. So in this installation
>> the /tmp/.torque-unix is not used at all or has it a different name?
>>
>>      
> I can't say that I know what is going on over there.
>
>
>    
>> When we run the same utitlies on a 2.4.7 installation the /tmp/.torque-unix is used and no child created.  The problem might be that the server  only handles one connection when /tmp/.torque-unix is used. So when i do i pbs_connect() an let it linger it will eventually timeout, but the pbs_server does not accept connections anymore till the timeout.
>>
>> That is why i asked if we can use the pbs_iff interface on the pbs_server again!!!
>>      
> ./configure --disable-unixsockets
>
> Note that, what I wrote it, the unix socket support was a huge performance
> boost and didn't suck up lots of privileged ports. But I can't comment on what
> happened to it in the 2.4.x branch.
>
>
>    
>> To trigger is it easy. Just use pbs_connect() and do not close it. We have tested it on:
>>    - debian lenny
>>    - centos 5
>>      
> wait... I thought you were having a problem with the basic stuff like qstat? Those always immediately exit.
>
> I may have been misunderstanding the problem all along.
>
>
>    
>> -------------------------------
>> If Found the problem on the pbs_server:
>>    - /var/spool/torque/server_name
>>
>> If this contains a name that is in /etc/hosts it uses the /tmp/.torque-unix mechanism that causes the problem. If is defined a name that must be 'resolved' other then /etc/hosts it will use the pbs_iff interface,  this has no problem because a child process is created.
>>
>> So the temporary solution is to use a name that must be resolved by DNS.
>>      
> No, it has nothing to do with DNS. Torque has no idea how a name is found. The
> lower-level system libs do that.
>
> If you look at the client lib code, there is a comparison after the name lookup
> against localhost and the server name.
>
> src/lib/Libifl/pbsD_connect.c:
> #ifdef ENABLE_UNIX_SOCKETS
>    /* determine if we want to use unix domain socket */
>
>    if (!strcmp(server, "localhost"))
>      use_unixsock = 1;
>    else if ((gethostname(hnamebuf, sizeof(hnamebuf) - 1) == 0)&&  !strcmp(hnamebuf, server))
>      use_unixsock = 1;
>
>
>
>    
>> The question is can the unix domain socket handle more the one connection?
>>      
> It certainly should. It is just a different transport layer. This is the first
> time I've heard a complaint.
>
>
>    
We have not changed anything to the unix sockets code. We have not 
changed anything with pbs_iff as well. We need to get into your system 
to see where the hang up is.

Ken Nielson
Adaptive Computing


More information about the torqueusers mailing list