[torqueusers] random reboots
akshar.bhosale at gmail.com
Mon Aug 16 12:32:49 MDT 2010
is ASR (automatic system recovery)
On Mon, Aug 16, 2010 at 10:10 PM, Brad Cavanagh <brad.cavanagh at gmail.com>wrote:
> Hi Jan,
> Random problems like this usually point to bad hardware, more than
> likely RAM. Do you see the same problems when you run the same job on
> the node manually (i.e. login to the node and run it, instead of
> sending it through your queue scheduler)?
> On Mon, Aug 16, 2010 at 9:39 AM, Jan Dettmer <jand at uvic.ca> wrote:
> > Hi all,
> > This may be the wrong place to post this problem but I am not sure where
> > start.
> > I have a cluster of several 8 core nodes that I run torque, open MPI, and
> > MAUI on debian. The cluster has been running flawless for several months
> > I usually run parallel jobs across the whole cluster. Late last week, I
> > started having problems with one of the nodes rebooting at what seems
> > random. This only happens when I am running a job on it. If it sits idle,
> > stays alive without reboots. The reboots are also completely out of the
> > without any signs in the debian logs.
> > The reboots happen after a job is started. The same code runs on the
> > nodes without problem for days.
> > Has anyone experienced this before and can point me towards possible
> > for this?
> > Thanks, Jan
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> torqueusers mailing list
> torqueusers at supercluster.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the torqueusers