[torqueusers] torque-2.5.3-1cri.x86_64 hang when a node falls

Arnau Bria arnaubria at pic.es
Tue Mar 29 14:08:16 MDT 2011


Hi David,

> I think you have it configured correctly. If you are doubting, you
> could double check to make sure that src/include/pbs_config.h has
> TCP_RETRY_LIMIT defined as you expect.

it's not there... 

# ./configure --prefix=/usr --with-server-home=/var/spool/pbs  --enable-maxdefault --disable-drmaa --with-default-server=pbs.pic.es --disable-xopen-networking  --disable-gui --with-rcp=scp --enable-high-availability --enable-libreadline --with-tcp-retry-limit=2
[...]
# grep  TCP src/include/pbs_config.h
#

> My question is, did you reinstall the moms as well as the server?
Nop, only the server.

> This error can happen from either end, a mom connecting to the server
> or the server connecting to a mom. (It isn't used from mom to mom
> though). It seems possible that maybe one mom didn't get the update
> and that's why the problem is happening.

Ok, will do teh update ASAP.

but it's very strange cause I only see this strange behaivour when one
of the blade internal switch goes into a strange state and nodes become
down but in a strange way.
When the node is really down, I don't see sevrer hanguing.... 

anyway, I'll come back when the upgrade is done.

thanks for the reply and sorry for sending the mail to you and not
to list, I'm not used to gmail webmail :-) .

Cheers,
Arnau


More information about the torqueusers mailing list