[torqueusers] torque-2.5.3-1cri.x86_64 hang when a node falls
David Beer
dbeer at adaptivecomputing.com
Tue Mar 29 15:20:56 MDT 2011
----- Original Message -----
> Hi David,
>
> > I think you have it configured correctly. If you are doubting, you
> > could double check to make sure that src/include/pbs_config.h has
> > TCP_RETRY_LIMIT defined as you expect.
>
> it's not there...
>
If its not there, then you know can definitely know that it isn't set up correctly. Try running configure again and then check if that fixes it.
> # ./configure --prefix=/usr --with-server-home=/var/spool/pbs
> --enable-maxdefault --disable-drmaa --with-default-server=pbs.pic.es
> --disable-xopen-networking --disable-gui --with-rcp=scp
> --enable-high-availability --enable-libreadline
> --with-tcp-retry-limit=2
> [...]
> # grep TCP src/include/pbs_config.h
> #
>
> > My question is, did you reinstall the moms as well as the server?
> Nop, only the server.
>
> > This error can happen from either end, a mom connecting to the
> > server
> > or the server connecting to a mom. (It isn't used from mom to mom
> > though). It seems possible that maybe one mom didn't get the update
> > and that's why the problem is happening.
>
> Ok, will do teh update ASAP.
>
> but it's very strange cause I only see this strange behaivour when one
> of the blade internal switch goes into a strange state and nodes
> become
> down but in a strange way.
> When the node is really down, I don't see sevrer hanguing....
>
> anyway, I'll come back when the upgrade is done.
>
Make sure you get pbs_config.h correct before do the reinstall across the board.
> thanks for the reply and sorry for sending the mail to you and not
> to list, I'm not used to gmail webmail :-) .
Its alright, not a big deal.
--
David Beer
Direct Line: 801-717-3386 | Fax: 801-717-3738
Adaptive Computing
1656 S. East Bay Blvd. Suite #300
Provo, UT 84606
More information about the torqueusers
mailing list