[torqueusers] fault tolerance
garrick at clusterresources.com
Thu Aug 10 21:34:07 MDT 2006
On Thu, Aug 10, 2006 at 05:43:29PM -0700, Alexander Saydakov alleged:
> > -----Original Message-----
> > From: torqueusers-bounces at supercluster.org [mailto:torqueusers-
> > bounces at supercluster.org] On Behalf Of 'Garrick Staples'
> > Sent: Thursday, August 10, 2006 2:31 PM
> > To: torqueusers at supercluster.org
> > Subject: Re: [torqueusers] fault tolerance
> > This odd quoting is really hard to read!
> Sorry, I think it was a default configuration in Outlook.
> Now it must be better, mustn't it?
Much more readable, thank you :)
> > > This does not sound right to me. Can we use an enforced walltime limit
> > to
> > > work around it?
> > No, if MS is down, the job will not exit.
> > We have plans to correct this, but presently, jobs will never exit while
> > MS
> > is down.
> Oh, it does not sound very good to me. Do you mean that mom is supposed to
> enforce walltime? So if mom is down, there is no other way besides manual
> 'qdel -p'?
Correct. MS keeps track of walltime, not pbs_server. But either way,
the job can't exit until pbs_server gets a jobobit request from MS.
> > > Yes, it does not know about the reboot, but it should be able to track
> > jobs
> > > somehow and realize that those pids are gone or belong to a different
> > > process.
> > Right, which is why the job exits after pbs_mom starts, the PIDs aren't
> > around anymore (which is a pretty shaky assumption to make after a
> > reboot because PIDs might be reused).
> If mom persisted both pid and command line it started, then it could compare
> if command line is the same (also may be shaky).
Which is why pbs_mom is started at boot without any args. "It is
assumed that on reboot, all processes have been killed."
More information about the torqueusers