[torqueusers] random reboots
jand at uvic.ca
Mon Aug 16 10:39:37 MDT 2010
This may be the wrong place to post this problem but I am not sure where
I have a cluster of several 8 core nodes that I run torque, open MPI,
and MAUI on debian. The cluster has been running flawless for several
months and I usually run parallel jobs across the whole cluster. Late
last week, I started having problems with one of the nodes rebooting at
what seems random. This only happens when I am running a job on it. If
it sits idle, it stays alive without reboots. The reboots are also
completely out of the blue without any signs in the debian logs.
The reboots happen after a job is started. The same code runs on the
other nodes without problem for days.
Has anyone experienced this before and can point me towards possible
causes for this?
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 296 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20100816/1bb2018c/attachment.vcf
More information about the torqueusers