[torqueusers] Question about the difference between a node where pbs_server is run and a compute node
Garrick Staples
garrick at usc.edu
Thu Apr 29 17:07:27 MDT 2010
On Thu, Apr 29, 2010 at 09:26:06AM +0200, Bas van der Vlies alleged:
>
> On 28 apr 2010, at 20:37, Garrick Staples wrote:
>
> > On Wed, Apr 28, 2010 at 08:05:08PM +0200, Bas van der Vlies alleged:
> >> Just a question is there switch in the configure to switch back to the old pbs_iff behaviour?
> >
> > What old pbs_iff behaviour? The unix domain socket code has been there since the 2.1.x days.
> >
>
> Garrick can you explain why our 2.1.11 pbs utilities use the 'pbs_iff' interface to communicate with the pbs_server if they run on the node where the pbs_server is started? We do not have any problems because a child is created and pbs_server can accept connections again. So in this installation
> the /tmp/.torque-unix is not used at all or has it a different name?
>
I can't say that I know what is going on over there.
> When we run the same utitlies on a 2.4.7 installation the /tmp/.torque-unix is used and no child created. The problem might be that the server only handles one connection when /tmp/.torque-unix is used. So when i do i pbs_connect() an let it linger it will eventually timeout, but the pbs_server does not accept connections anymore till the timeout.
>
> That is why i asked if we can use the pbs_iff interface on the pbs_server again!!!
./configure --disable-unixsockets
Note that, what I wrote it, the unix socket support was a huge performance
boost and didn't suck up lots of privileged ports. But I can't comment on what
happened to it in the 2.4.x branch.
> To trigger is it easy. Just use pbs_connect() and do not close it. We have tested it on:
> - debian lenny
> - centos 5
wait... I thought you were having a problem with the basic stuff like qstat? Those always immediately exit.
I may have been misunderstanding the problem all along.
> -------------------------------
> If Found the problem on the pbs_server:
> - /var/spool/torque/server_name
>
> If this contains a name that is in /etc/hosts it uses the /tmp/.torque-unix mechanism that causes the problem. If is defined a name that must be 'resolved' other then /etc/hosts it will use the pbs_iff interface, this has no problem because a child process is created.
>
> So the temporary solution is to use a name that must be resolved by DNS.
No, it has nothing to do with DNS. Torque has no idea how a name is found. The
lower-level system libs do that.
If you look at the client lib code, there is a comparison after the name lookup
against localhost and the server name.
src/lib/Libifl/pbsD_connect.c:
#ifdef ENABLE_UNIX_SOCKETS
/* determine if we want to use unix domain socket */
if (!strcmp(server, "localhost"))
use_unixsock = 1;
else if ((gethostname(hnamebuf, sizeof(hnamebuf) - 1) == 0) && !strcmp(hnamebuf, server))
use_unixsock = 1;
> The question is can the unix domain socket handle more the one connection?
It certainly should. It is just a different transport layer. This is the first
time I've heard a complaint.
--
Garrick Staples, GNU/Linux HPCC SysAdmin
University of Southern California
Life is Good!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20100429/2db1a8e1/attachment-0001.bin
More information about the torqueusers
mailing list