[torqueusers] possible protocol problem.
Chris Samuel
csamuel at vpac.org
Tue Nov 16 15:09:04 MST 2004
On Wed, 17 Nov 2004 12:59 am, Chris Johnson wrote:
> Our nodes sometimes go into a comatose state in which they can
> be pinged but not ssh'ed or rsh'ed to at all.
This sounds like a kernel problem we used to have before we dumped the RH 7.3
kernels and went to a stock 2.4.26 instead.
It appeared to be a well known bug in the OOM killer that fails to take action
in time and results in the node deadlocking as you see. Disabling the OOM
killer returns you to the kernel killing processes that fail on a malloc. Not
pretty but at least you don't have to power-cycle the node.
I'm suprised that you're seeing this on FC2 though, is this with a 2.6
kernel ?
> But this undetected down node problem pretty much squashes
> that idea.
>
> Has anybody seen this problem?
Not since we disabled the OOM killer in our kernels I think. :-(
I guess you could simulate it by killing the mom on a node and making netcat
listen for connections on its ports instead ?
cheers,
Chris
--
Christopher Samuel - (03)9925 4751 - VPAC Systems & Network Admin
Victorian Partnership for Advanced Computing http://www.vpac.org/
Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20041117/84edac46/attachment.bin
More information about the torqueusers
mailing list