[torqueusers] random reboots

Brad Cavanagh brad.cavanagh at gmail.com
Mon Aug 16 10:40:52 MDT 2010


Hi Jan,

Random problems like this usually point to bad hardware, more than
likely RAM. Do you see the same problems when you run the same job on
the node manually (i.e. login to the node and run it, instead of
sending it through your queue scheduler)?

Brad.

On Mon, Aug 16, 2010 at 9:39 AM, Jan Dettmer <jand at uvic.ca> wrote:
> Hi all,
>
> This may be the wrong place to post this problem but I am not sure where to
> start.
>
> I have a cluster of several 8 core nodes that I run torque, open MPI, and
> MAUI on debian. The cluster has been running flawless for several months and
> I usually run parallel jobs across the whole cluster. Late last week, I
> started having problems with one of the nodes rebooting at what seems
> random. This only happens when I am running a job on it. If it sits idle, it
> stays alive without reboots. The reboots are also completely out of the blue
> without any signs in the debian logs.
>
> The reboots happen after a job is started. The same code runs on the other
> nodes without problem for days.
>
> Has anyone experienced this before and can point me towards possible causes
> for this?
>
> Thanks, Jan
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>


More information about the torqueusers mailing list