[torqueusers] need help with 1.2.0p1 snapshot testing

Garrick Staples garrick at usc.edu
Tue Feb 15 16:03:13 MST 2005


On Wed, Feb 16, 2005 at 09:47:12AM +1100, Chris Samuel alleged:
> On Tue, 15 Feb 2005 05:29 pm, Garrick Staples wrote:
> 
> > Queued, yes. ?Running, no. ?Earlier versions don't save the necessary info
> > to properly preserve the tm state. ?The fixes in the new code have as much
> > to do with _saving_ as _recovery_.
> 
> Ah, no I understood that, it's just that (at the moment) only a couple of our 
> users are running with mpiexec and I thought if I could upgrade when they 
> weren't running jobs and not affect the non-mpiexec ones (many are 
> uniprocessor jobs, some happily run for over 3 months) then that would be 
> great.
> 
> So I was wondering if I restarted a mom for one of those, or someone using the 
> MPICH mpirun (using ssh instead of rsh) then would it affect those jobs ?

I don't think that's a good idea.  If you kill a sister with an old MS
running, the MS will set the job to exiting.  If you kill an MS with an old
sister, then I think you are OK.  So maybe if you made sure to restart the MS
moms first, before restarting the sisters; but I haven't tested any of this.

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20050215/00f6329d/attachment.bin


More information about the torqueusers mailing list