[torqueusers] random reboots
jbernstein at penguincomputing.com
Mon Aug 16 12:51:22 MDT 2010
I've got to agree with Brad here. This sounds like some memory going
bad. You might want to put the machine through its paces with memtest86:
Also, it might help to add the 'noreboot' option on the nodes kernel
command line and attach a console to the display port to see if you can
catch a kernel dump or backtrace.
Brad Cavanagh wrote:
> Hi Jan,
> Random problems like this usually point to bad hardware, more than
> likely RAM. Do you see the same problems when you run the same job on
> the node manually (i.e. login to the node and run it, instead of
> sending it through your queue scheduler)?
> On Mon, Aug 16, 2010 at 9:39 AM, Jan Dettmer <jand at uvic.ca> wrote:
>> Hi all,
>> This may be the wrong place to post this problem but I am not sure where to
>> I have a cluster of several 8 core nodes that I run torque, open MPI, and
>> MAUI on debian. The cluster has been running flawless for several months and
>> I usually run parallel jobs across the whole cluster. Late last week, I
>> started having problems with one of the nodes rebooting at what seems
>> random. This only happens when I am running a job on it. If it sits idle, it
>> stays alive without reboots. The reboots are also completely out of the blue
>> without any signs in the debian logs.
>> The reboots happen after a job is started. The same code runs on the other
>> nodes without problem for days.
>> Has anyone experienced this before and can point me towards possible causes
>> for this?
>> Thanks, Jan
>> torqueusers mailing list
>> torqueusers at supercluster.org
> torqueusers mailing list
> torqueusers at supercluster.org
More information about the torqueusers