[torqueusers] Re: FW: HPUX 11 failure torque 2.0.0p1, 2, 3 and 1.2.0p6

Garrick Staples garrick at usc.edu
Thu Dec 15 16:16:13 MST 2005


Did I get everything right in this p4 snap?
http://www.clusterresources.com/downloads/torque/snapshots/torque-2.0.0p4-snap.1134687812.tar.gz

On Thu, Dec 15, 2005 at 02:18:11PM -0800, Garrick Staples alleged:
> From what I gather, these problems stem from differences in the last arg
> to accept(), getsockopt(), and setsockopt().  BSD has it as a "int *",
> earlier POSIX changed it to "size_t *" (which is unsigned, and not the
> same size as int), later POSIX changed it to "socklen_t *" (unsigned,
> same size as int.)
> 
> It was changed from int because of mixed signed issues, but I guess I'll
> just change them all back to int.  That seems to be the most portable
> thing.
> 
> On Thu, Dec 15, 2005 at 04:01:22PM -0600, Mike Coyne alleged:
> > Here is a diff of net_server.c  in src/lib/Libnet I made the "second"
> > change in the current p3 build of net_server added backing the #if
> > defined _SOCKLEN_T stuff. It appeared to correct the problem with the
> > PBSE_BADCRED. This is a little premature I need to do a install on p3
> > and run it a bit though. This also included the previous fix on
> > new_client.c 
> > 
> > Hpux seems to have hartburn with socklen_t ... 
> > 
> > *** net_server.c        Wed Nov  9 00:38:22 2005
> > --- /home/mcoyne/torque/torque-2.0.0p2/src/lib/Libnet/net_server.c
> > Fri Nov 11 03:27:08 2005
> > ***************
> > *** 259,270 ****
> >     struct timeval timeout;
> >     void close_conn();
> >   
> > -   timeout.tv_usec = 0;
> > -   timeout.tv_sec  = waittime;
> > - 
> >     char tmpLine[1024];
> >     char id[]="wait_request";
> >   
> >     selset = readset;  /* readset is global */
> >   
> >     n = select(FD_SETSIZE,&selset,(fd_set *)0,(fd_set *)0,&timeout);
> > --- 259,270 ----
> >     struct timeval timeout;
> >     void close_conn();
> >   
> >     char tmpLine[1024];
> >     char id[]="wait_request";
> >   
> > +   timeout.tv_usec = 0;
> > +   timeout.tv_sec  = waittime;
> > + 
> >     selset = readset;  /* readset is global */
> >   
> >     n = select(FD_SETSIZE,&selset,(fd_set *)0,(fd_set *)0,&timeout);
> > ***************
> > *** 401,411 ****
> >     int newsock;
> >     struct sockaddr_in from;
> >   
> > - #if defined _SOCKLEN_T
> >     socklen_t fromsize;
> > - #else /* _SOCKLEN_T */
> > -   int fromsize;
> > - #endif /* _SCOKLEN_T */
> >   
> >     /* update lasttime of main socket */
> >   
> > --- 401,407 ----
> > 
> > -----Original Message-----
> > From: Garrick Staples [mailto:garrick at usc.edu] 
> > Sent: Thursday, December 15, 2005 2:24 PM
> > To: Mike Coyne
> > Cc: Lippert, Kenneth B.; torqueusers at supercluster.org
> > Subject: Re: FW: HPUX 11 failure torque 2.0.0p1,2,3 and 1.2.0p6
> > 
> > On Thu, Dec 15, 2005 at 01:29:29PM -0600, Mike Coyne alleged:
> > > There are some issues regarding HPUX and torque in versions after
> > > 1.2.0p5 surrounding pbs_iff on the client and server side.  On the
> > > client side , src/lib/Netlib/net_client.c
> > > 
> > >  
> > > 
> > > Below is a diff between 2.0.0.p3 and 1.2.0.p5 , in order to get
> > pbs_iff
> > > to connect from a remote host( one of the mom clients) I had to
> > backport
> > > the older version of this file ..
> > 
> > The bits with tv_sec and select() don't look important to me.
> > 
> > The important part might be the size of 'one'.  I'm thinking it should
> > be an int, not a long.  Can you try just that one change in p3?
> > 
> > @@ -177,7 +172,7 @@ int client_to_svr(
> >    int                sock;
> >    unsigned short     tryport;
> >    int                flags;
> > -  int                one = 1;
> > +  long               one = 1;
> >    
> >    local.sin_family = AF_INET;
> >    local.sin_addr.s_addr = 0;
> > 
> > 
> > The arguments changes to setsockopt() appears correct to me, especially
> > the last argument.
> > 
> >  
> > > In order to get src/resmom/hpux11 (or hpux10) / mom_mach.c  to compile
> > I
> > > added 
> > > 
> > >  
> > > 
> > >  extern  int     ignwalltime;
> > 
> > Ouch.  Fixed in CVS.
> > 
> > 
> > > The remaining problem is  as follows,  
> > > 
> > > Pbs_iff  dis connects with invalid credential ==>PBSE_BADCRED in
> > > src/server/process_request.c from 
> > 
> > This would imply the bind() to a priviledged port isn't working.
> > 
> > Do you have bindresvport() on HPUX?
> > 
> > 
> >  
> > > The output from gdb's server_conn has a suspious cn_addr  the
> > connection
> > > was from a qstat on the same host as the server ?  although this may
> > be
> > > fallout from a previous authentication error ?
> > 
> > > (gdb) print svr_conn[sfds]
> > > 
> > > $1 = {cn_addr = 2147483649, cn_handle = -1, cn_port = 40696, cn_authen
> > =
> > > 0, 
> > 
> > cn_port should probably be less than 1024 at that point.
> > 
> > -- 
> > Garrick Staples, Linux/HPCC Administrator
> > University of Southern California
> > 
> 
> -- 
> Garrick Staples, Linux/HPCC Administrator
> University of Southern California



> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20051215/8fee9d31/attachment.bin


More information about the torqueusers mailing list