[torqueusers] Re: mom segfault in new diag code

Chris Samuel csamuel at vpac.org
Sun Oct 31 16:04:41 MST 2004


On Mon, 1 Nov 2004 09:47 am, Garrick Staples wrote:

> And you can reliably not break jobs?   If I went through the entire cluster
> and restarted every mom, I know I'll lose half the jobs.

I forgot to say that when we've had to do that because of losing a mom 
somewhere and needing to restart everything to get it back in sync (before 
the fix in p3) this hadn't been a problem for us.

Of course it may be that we are running jobs that are not so susceptible to 
this, it's hard to know because our userbase is so diverse..

cheers!
Chris
-- 
 Christopher Samuel - (03)9925 4751 - VPAC Systems & Network Admin
 Victorian Partnership for Advanced Computing http://www.vpac.org/
 Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20041101/afc197db/attachment.bin


More information about the torqueusers mailing list