[torqueusers] random reboots

Joshua Bernstein jbernstein at penguincomputing.com
Mon Aug 16 12:51:22 MDT 2010


Hi Jan,

I've got to agree with Brad here. This sounds like some memory going 
bad. You might want to put the machine through its paces with memtest86:

http://www.memtest.org/

Also, it might help to add the 'noreboot' option on the nodes kernel 
command line and attach a console to the display port to see if you can 
catch a kernel dump or backtrace.

-Joshua Bernstein
Penguin Computing

Brad Cavanagh wrote:
> Hi Jan,
> 
> Random problems like this usually point to bad hardware, more than
> likely RAM. Do you see the same problems when you run the same job on
> the node manually (i.e. login to the node and run it, instead of
> sending it through your queue scheduler)?
> 
> Brad.
> 
> On Mon, Aug 16, 2010 at 9:39 AM, Jan Dettmer <jand at uvic.ca> wrote:
>> Hi all,
>>
>> This may be the wrong place to post this problem but I am not sure where to
>> start.
>>
>> I have a cluster of several 8 core nodes that I run torque, open MPI, and
>> MAUI on debian. The cluster has been running flawless for several months and
>> I usually run parallel jobs across the whole cluster. Late last week, I
>> started having problems with one of the nodes rebooting at what seems
>> random. This only happens when I am running a job on it. If it sits idle, it
>> stays alive without reboots. The reboots are also completely out of the blue
>> without any signs in the debian logs.
>>
>> The reboots happen after a job is started. The same code runs on the other
>> nodes without problem for days.
>>
>> Has anyone experienced this before and can point me towards possible causes
>> for this?
>>
>> Thanks, Jan
>>
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list