[torqueusers] need help with 1.2.0p1 snapshot testing

Garrick Staples garrick at usc.edu
Mon Feb 14 18:13:18 MST 2005


The latest snapshot, 1.2.0p1-snap.1108426118, has folded in some big changes.
I've been banging on these patches for a few weeks and everything seems pretty
solid to me, but it could really use some wider testing.

Job starting got a major overhaul after 1.1.0p4 and we saw a lot of regressions
in 1.1.0p5.  This latest snapshot fixes up a few more bugs.  Job starting
should be rock solid now.  Please test!

There's a new pbs_mom config parameter called $jobstartblocktime that defines
how long pbs_server will initially block while waiting for a job to start.  It
defaults to 5 seconds, but we'd like people to test lower values like 1 or 0.
The lower the value, the better pbs_server should respond to client requests
(like qstat) while starting up jobs.  If 0 doesn't cause any problems, it will
be the default in future releases.  Please test!

Another shaky area is with restarting pbs_mom daemons.  It should now be
possible to restart any daemon at any time without breaking jobs.  pbsdsh has
been enhanced to live in this world of restarting moms.  I can already tell
you that mpiexec won't deal with it properly.  I'm worried about these changes
effecting the recoverability of failing jobs.  Please test!

There are other changes too, but those 3 really need more testing.

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20050214/2697030f/attachment.bin


More information about the torqueusers mailing list