[torqueusers] stability again

'Garrick Staples' garrick at usc.edu
Fri Sep 29 16:53:15 MDT 2006


On Fri, Sep 29, 2006 at 03:34:26PM -0700, Alexander Saydakov alleged:
> > -----Original Message-----
> > From: torqueusers-bounces at supercluster.org [mailto:torqueusers-
> > bounces at supercluster.org] On Behalf Of Garrick Staples
> > Sent: Friday, September 29, 2006 3:13 PM
> > To: torqueusers at supercluster.org
> > Subject: Re: [torqueusers] stability again
> > 
> > On Fri, Sep 29, 2006 at 02:58:24PM -0700, Alexander Saydakov alleged:
> > > > -----Original Message-----
> > > > From: torqueusers-bounces at supercluster.org [mailto:torqueusers-
> > > > bounces at supercluster.org] On Behalf Of Garrick Staples
> > > > Sent: Friday, September 29, 2006 1:22 PM
> > > > To: torqueusers at supercluster.org
> > > > Subject: Re: [torqueusers] stability again
> > > >
> > > > On Fri, Sep 29, 2006 at 10:38:39AM -0700, Alexander Saydakov alleged:
> > > In this particular case jobs were in exiting state because moms tried to
> > > deliver huge error files to faulty NFS, which made nodes unresponsive
> > even
> > > to ssh. So I put nodes offline and purged jobs (maybe I did not really
> > need
> > > to do so, but I wanted to get rid of them). After several hours admins
> > > rebooted those boxes for us, which crashed the server.
> > 
> > Oh, so you are intentially breaking it with a purge!  That voids your
> > warranty.
> > 
> > If the node is going to come back with the job, then don't purge the
> > jobs.  Just let them wait.
> 
> Maybe it was a stupid thing to do, but I did not want those jobs to come
> back. They were faulty and made nodes stuck by overloading NFS, so they
> could do it again. And by the way, I don't believe that decision whether to
> start those jobs again or not should be made by mom. I think it would be
> better to requeue it and observe priorities. And also I don't think that we
> should allow server to crash because of deleting jobs, rebooting nodes or
> sending duplicate requests from mom.

Of course we don't want pbs_server to crash.  Any and all crashes will
be fixed.
 

> And remember that problem we discussed a while ago regarding a node going
> down forever? I would love the server to consider jobs on that node failed
> eventually.

We are making progress on allowing this to happen sanely.  We're just
not there yet.
 

> > Purging the jobs while they still exist on the nodes is a pretty likely
> > cause of this problem.  The attached patch could fix this problem.
> 
> Could fix? What does it do? Is it a part of 2.1.2?

"could" because I haven't actually duplicated this specific problem and
tested it.  The patch is against trunk but it should apply to 2.1.2.

 
> Thanks for your great help as always. You are very responsive. I really
> appreciate it.

I live to serve :)

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20060929/718fed19/attachment.bin


More information about the torqueusers mailing list