[torqueusers] torque-2.5.3-1cri.x86_64 hang when a node falls

David Beer dbeer at adaptivecomputing.com
Tue Mar 29 15:20:56 MDT 2011



----- Original Message -----
> Hi David,
> 
> > I think you have it configured correctly. If you are doubting, you
> > could double check to make sure that src/include/pbs_config.h has
> > TCP_RETRY_LIMIT defined as you expect.
> 
> it's not there...
> 

If its not there, then you know can definitely know that it isn't set up correctly. Try running configure again and then check if that fixes it. 

> # ./configure --prefix=/usr --with-server-home=/var/spool/pbs
> --enable-maxdefault --disable-drmaa --with-default-server=pbs.pic.es
> --disable-xopen-networking --disable-gui --with-rcp=scp
> --enable-high-availability --enable-libreadline
> --with-tcp-retry-limit=2
> [...]
> # grep TCP src/include/pbs_config.h
> #
> 
> > My question is, did you reinstall the moms as well as the server?
> Nop, only the server.
> 
> > This error can happen from either end, a mom connecting to the
> > server
> > or the server connecting to a mom. (It isn't used from mom to mom
> > though). It seems possible that maybe one mom didn't get the update
> > and that's why the problem is happening.
> 
> Ok, will do teh update ASAP.
> 
> but it's very strange cause I only see this strange behaivour when one
> of the blade internal switch goes into a strange state and nodes
> become
> down but in a strange way.
> When the node is really down, I don't see sevrer hanguing....
> 
> anyway, I'll come back when the upgrade is done.
> 

Make sure you get pbs_config.h correct before do the reinstall across the board.

> thanks for the reply and sorry for sending the mail to you and not
> to list, I'm not used to gmail webmail :-) .

Its alright, not a big deal.

-- 
David Beer 
Direct Line: 801-717-3386 | Fax: 801-717-3738
     Adaptive Computing
     1656 S. East Bay Blvd. Suite #300
     Provo, UT 84606



More information about the torqueusers mailing list