[torqueusers] Question about the difference between a node where pbs_server is run and a compute node

Garrick Staples garrick at usc.edu
Thu Apr 29 17:07:27 MDT 2010


On Thu, Apr 29, 2010 at 09:26:06AM +0200, Bas van der Vlies alleged:
> 
> On 28 apr 2010, at 20:37, Garrick Staples wrote:
> 
> > On Wed, Apr 28, 2010 at 08:05:08PM +0200, Bas van der Vlies alleged:
> >> Just a question is there switch in the configure to switch back to the old pbs_iff behaviour?
> > 
> > What old pbs_iff behaviour? The unix domain socket code has been there since the 2.1.x days.
> > 
> 
> Garrick can you explain why our 2.1.11 pbs utilities use the 'pbs_iff' interface to communicate with the pbs_server if they run on the node where the pbs_server is started?  We do not have any problems because a child is created and pbs_server can accept connections again. So in this installation
> the /tmp/.torque-unix is not used at all or has it a different name? 
> 

I can't say that I know what is going on over there.


> When we run the same utitlies on a 2.4.7 installation the /tmp/.torque-unix is used and no child created.  The problem might be that the server  only handles one connection when /tmp/.torque-unix is used. So when i do i pbs_connect() an let it linger it will eventually timeout, but the pbs_server does not accept connections anymore till the timeout. 
> 
> That is why i asked if we can use the pbs_iff interface on the pbs_server again!!!  

./configure --disable-unixsockets

Note that, what I wrote it, the unix socket support was a huge performance
boost and didn't suck up lots of privileged ports. But I can't comment on what
happened to it in the 2.4.x branch.


> To trigger is it easy. Just use pbs_connect() and do not close it. We have tested it on:
>   - debian lenny
>   - centos 5

wait... I thought you were having a problem with the basic stuff like qstat? Those always immediately exit.
 
I may have been misunderstanding the problem all along.


> ------------------------------- 
> If Found the problem on the pbs_server:
>   - /var/spool/torque/server_name
> 
> If this contains a name that is in /etc/hosts it uses the /tmp/.torque-unix mechanism that causes the problem. If is defined a name that must be 'resolved' other then /etc/hosts it will use the pbs_iff interface,  this has no problem because a child process is created. 
> 
> So the temporary solution is to use a name that must be resolved by DNS.  

No, it has nothing to do with DNS. Torque has no idea how a name is found. The
lower-level system libs do that.

If you look at the client lib code, there is a comparison after the name lookup
against localhost and the server name.

src/lib/Libifl/pbsD_connect.c:
#ifdef ENABLE_UNIX_SOCKETS
  /* determine if we want to use unix domain socket */

  if (!strcmp(server, "localhost"))
    use_unixsock = 1;
  else if ((gethostname(hnamebuf, sizeof(hnamebuf) - 1) == 0) && !strcmp(hnamebuf, server))
    use_unixsock = 1;

 

> The question is can the unix domain socket handle more the one connection?

It certainly should. It is just a different transport layer. This is the first
time I've heard a complaint.


-- 
Garrick Staples, GNU/Linux HPCC SysAdmin
University of Southern California

Life is Good!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20100429/2db1a8e1/attachment-0001.bin 


More information about the torqueusers mailing list