[torqueusers] possible protocol problem.

Chris Samuel csamuel at vpac.org
Tue Nov 16 15:09:04 MST 2004


On Wed, 17 Nov 2004 12:59 am, Chris Johnson wrote:

>      Our nodes sometimes go into a comatose state in which they can
> be pinged but not ssh'ed or rsh'ed to at all.

This sounds like a kernel problem we used to have before we dumped the RH 7.3 
kernels and went to a stock 2.4.26 instead.

It appeared to be a well known bug in the OOM killer that fails to take action 
in time and results in the node deadlocking as you see.   Disabling the OOM 
killer returns you to the kernel killing processes that fail on a malloc. Not 
pretty but at least you don't have to power-cycle the node.

I'm suprised that you're seeing this on FC2 though, is this with a 2.6 
kernel ?

> But this undetected down node problem pretty much squashes
> that idea.   
> 
> Has anybody seen this problem? 

Not since we disabled the OOM killer in our kernels I think. :-(

I guess you could simulate it by killing the mom on a node and making netcat 
listen for connections on its ports instead ?

cheers,
Chris
-- 
 Christopher Samuel - (03)9925 4751 - VPAC Systems & Network Admin
 Victorian Partnership for Advanced Computing http://www.vpac.org/
 Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20041117/84edac46/attachment.bin


More information about the torqueusers mailing list