[torqueusers] Re: mom segfault in new diag code

Garrick Staples garrick at usc.edu
Sun Oct 31 16:08:12 MST 2004


On Mon, Nov 01, 2004 at 09:58:32AM +1100, Chris Samuel alleged:
> On Mon, 1 Nov 2004 09:47 am, Garrick Staples wrote:
> 
> > > We *always* run the mom's with the -p flag for just this reason. :-)
> >
> > And you can reliably not break jobs? ? If I went through the entire cluster
> > and restarted every mom, I know I'll lose half the jobs.
> 
> Ahh, our PBS scripts method for stopping the mom on compute nodes is:
> 
>  kill -9
> 
> That way it doesn't get time to think about what it's going to do to the jobs 
> running on its node.. ;-)

I must admit, I haven't tested this with kill -9!

Recipe for success: never let pbs_mom die gracefully and always start with -p?


-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20041031/5e64aa53/attachment.bin


More information about the torqueusers mailing list