[torqueusers] random reboots

Jan Dettmer jand at uvic.ca
Mon Aug 16 10:39:37 MDT 2010


Hi all,

This may be the wrong place to post this problem but I am not sure where 
to start.

I have a cluster of several 8 core nodes that I run torque, open MPI, 
and MAUI on debian. The cluster has been running flawless for several 
months and I usually run parallel jobs across the whole cluster. Late 
last week, I started having problems with one of the nodes rebooting at 
what seems random. This only happens when I am running a job on it. If 
it sits idle, it stays alive without reboots. The reboots are also 
completely out of the blue without any signs in the debian logs.

The reboots happen after a job is started. The same code runs on the 
other nodes without problem for days.

Has anyone experienced this before and can point me towards possible 
causes for this?

Thanks, Jan

-------------- next part --------------
A non-text attachment was scrubbed...
Name: jand.vcf
Type: text/x-vcard
Size: 296 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20100816/1bb2018c/attachment.vcf 


More information about the torqueusers mailing list