[torqueusers] fault tolerance

Alexander Saydakov saydakov at yahoo-inc.com
Thu Aug 10 18:43:29 MDT 2006


> -----Original Message-----
> From: torqueusers-bounces at supercluster.org [mailto:torqueusers-
> bounces at supercluster.org] On Behalf Of 'Garrick Staples'
> Sent: Thursday, August 10, 2006 2:31 PM
> To: torqueusers at supercluster.org
> Subject: Re: [torqueusers] fault tolerance
> 
> This odd quoting is really hard to read!

Sorry, I think it was a default configuration in Outlook.
Now it must be better, mustn't it?

> > This does not sound right to me. Can we use an enforced walltime limit
> to
> > work around it?
> 
> No, if MS is down, the job will not exit.
> 
> We have plans to correct this, but presently, jobs will never exit while
> MS
> is down.

Oh, it does not sound very good to me. Do you mean that mom is supposed to
enforce walltime? So if mom is down, there is no other way besides manual
'qdel -p'?

> > Yes, it does not know about the reboot, but it should be able to track
> jobs
> > somehow and realize that those pids are gone or belong to a different
> > process.
> 
> Right, which is why the job exits after pbs_mom starts, the PIDs aren't
> around anymore (which is a pretty shaky assumption to make after a
> reboot because PIDs might be reused).

If mom persisted both pid and command line it started, then it could compare
if command line is the same (also may be shaky).




More information about the torqueusers mailing list