[torqueusers] stability again

Alexander Saydakov saydakov at yahoo-inc.com
Fri Sep 29 16:34:26 MDT 2006


> -----Original Message-----
> From: torqueusers-bounces at supercluster.org [mailto:torqueusers-
> bounces at supercluster.org] On Behalf Of Garrick Staples
> Sent: Friday, September 29, 2006 3:13 PM
> To: torqueusers at supercluster.org
> Subject: Re: [torqueusers] stability again
> 
> On Fri, Sep 29, 2006 at 02:58:24PM -0700, Alexander Saydakov alleged:
> > > -----Original Message-----
> > > From: torqueusers-bounces at supercluster.org [mailto:torqueusers-
> > > bounces at supercluster.org] On Behalf Of Garrick Staples
> > > Sent: Friday, September 29, 2006 1:22 PM
> > > To: torqueusers at supercluster.org
> > > Subject: Re: [torqueusers] stability again
> > >
> > > On Fri, Sep 29, 2006 at 10:38:39AM -0700, Alexander Saydakov alleged:
> > In this particular case jobs were in exiting state because moms tried to
> > deliver huge error files to faulty NFS, which made nodes unresponsive
> even
> > to ssh. So I put nodes offline and purged jobs (maybe I did not really
> need
> > to do so, but I wanted to get rid of them). After several hours admins
> > rebooted those boxes for us, which crashed the server.
> 
> Oh, so you are intentially breaking it with a purge!  That voids your
> warranty.
> 
> If the node is going to come back with the job, then don't purge the
> jobs.  Just let them wait.

Maybe it was a stupid thing to do, but I did not want those jobs to come
back. They were faulty and made nodes stuck by overloading NFS, so they
could do it again. And by the way, I don't believe that decision whether to
start those jobs again or not should be made by mom. I think it would be
better to requeue it and observe priorities. And also I don't think that we
should allow server to crash because of deleting jobs, rebooting nodes or
sending duplicate requests from mom.
 
And remember that problem we discussed a while ago regarding a node going
down forever? I would love the server to consider jobs on that node failed
eventually.

> Purging the jobs while they still exist on the nodes is a pretty likely
> cause of this problem.  The attached patch could fix this problem.

Could fix? What does it do? Is it a part of 2.1.2?

Thanks for your great help as always. You are very responsive. I really
appreciate it.




More information about the torqueusers mailing list