[torqueusers] random reboots

akshar bhosale akshar.bhosale at gmail.com
Mon Aug 16 12:32:49 MDT 2010


is ASR (automatic system recovery)
enabled?

On Mon, Aug 16, 2010 at 10:10 PM, Brad Cavanagh <brad.cavanagh at gmail.com>wrote:

> Hi Jan,
>
> Random problems like this usually point to bad hardware, more than
> likely RAM. Do you see the same problems when you run the same job on
> the node manually (i.e. login to the node and run it, instead of
> sending it through your queue scheduler)?
>
> Brad.
>
> On Mon, Aug 16, 2010 at 9:39 AM, Jan Dettmer <jand at uvic.ca> wrote:
> > Hi all,
> >
> > This may be the wrong place to post this problem but I am not sure where
> to
> > start.
> >
> > I have a cluster of several 8 core nodes that I run torque, open MPI, and
> > MAUI on debian. The cluster has been running flawless for several months
> and
> > I usually run parallel jobs across the whole cluster. Late last week, I
> > started having problems with one of the nodes rebooting at what seems
> > random. This only happens when I am running a job on it. If it sits idle,
> it
> > stays alive without reboots. The reboots are also completely out of the
> blue
> > without any signs in the debian logs.
> >
> > The reboots happen after a job is started. The same code runs on the
> other
> > nodes without problem for days.
> >
> > Has anyone experienced this before and can point me towards possible
> causes
> > for this?
> >
> > Thanks, Jan
> >
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> >
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20100817/b56e212a/attachment.html 


More information about the torqueusers mailing list