[torqueusers] stability again
'Garrick Staples'
garrick at usc.edu
Fri Sep 29 16:53:15 MDT 2006
On Fri, Sep 29, 2006 at 03:34:26PM -0700, Alexander Saydakov alleged:
> > -----Original Message-----
> > From: torqueusers-bounces at supercluster.org [mailto:torqueusers-
> > bounces at supercluster.org] On Behalf Of Garrick Staples
> > Sent: Friday, September 29, 2006 3:13 PM
> > To: torqueusers at supercluster.org
> > Subject: Re: [torqueusers] stability again
> >
> > On Fri, Sep 29, 2006 at 02:58:24PM -0700, Alexander Saydakov alleged:
> > > > -----Original Message-----
> > > > From: torqueusers-bounces at supercluster.org [mailto:torqueusers-
> > > > bounces at supercluster.org] On Behalf Of Garrick Staples
> > > > Sent: Friday, September 29, 2006 1:22 PM
> > > > To: torqueusers at supercluster.org
> > > > Subject: Re: [torqueusers] stability again
> > > >
> > > > On Fri, Sep 29, 2006 at 10:38:39AM -0700, Alexander Saydakov alleged:
> > > In this particular case jobs were in exiting state because moms tried to
> > > deliver huge error files to faulty NFS, which made nodes unresponsive
> > even
> > > to ssh. So I put nodes offline and purged jobs (maybe I did not really
> > need
> > > to do so, but I wanted to get rid of them). After several hours admins
> > > rebooted those boxes for us, which crashed the server.
> >
> > Oh, so you are intentially breaking it with a purge! That voids your
> > warranty.
> >
> > If the node is going to come back with the job, then don't purge the
> > jobs. Just let them wait.
>
> Maybe it was a stupid thing to do, but I did not want those jobs to come
> back. They were faulty and made nodes stuck by overloading NFS, so they
> could do it again. And by the way, I don't believe that decision whether to
> start those jobs again or not should be made by mom. I think it would be
> better to requeue it and observe priorities. And also I don't think that we
> should allow server to crash because of deleting jobs, rebooting nodes or
> sending duplicate requests from mom.
Of course we don't want pbs_server to crash. Any and all crashes will
be fixed.
> And remember that problem we discussed a while ago regarding a node going
> down forever? I would love the server to consider jobs on that node failed
> eventually.
We are making progress on allowing this to happen sanely. We're just
not there yet.
> > Purging the jobs while they still exist on the nodes is a pretty likely
> > cause of this problem. The attached patch could fix this problem.
>
> Could fix? What does it do? Is it a part of 2.1.2?
"could" because I haven't actually duplicated this specific problem and
tested it. The patch is against trunk but it should apply to 2.1.2.
> Thanks for your great help as always. You are very responsive. I really
> appreciate it.
I live to serve :)
--
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20060929/718fed19/attachment.bin
More information about the torqueusers
mailing list