[torqueusers] OpenPBS to Torque upgrade

Steve Traylen s.traylen at rl.ac.uk
Wed Mar 9 02:12:27 MST 2005


On Tue, Mar 08, 2005 at 03:29:35PM -0700 or thereabouts, David Jackson wrote:
> Chris, Steve,
> 
>   We have found no reason within the code that a change in communication
> protocol would impact queued workload.  As Chris mentioned, you should
> be able to shut down all TORQUE components, rebuild, and then start
> TORQUE.  Queued workload should pick up where you left off.  The next
> time you upgrade from 1.2.x to a later release, you may even be able to
> save most or all running jobs thanks to USC's MOM enhancements. 

Thanks David,

  In our case we don't run MPI jobs at all, only single processor jobs.
  Currently restarting MOMs does not cause any adverse affects that I am 
  aware of.  We use the '-p' at start up time on the mom.

  So in principal I should be able to get away with it even with running
  jobs.

  One more question about '--disable-rpp'. In our case this makes sense
  with 1200 CPUs.  I'm currently building torque that gets deployed
  at about 80 sites. What is disadvantage of using '--disable-rpp' on 
  farm of 2 CPUs.

    Steve


> 
>   Please let us know what you find.
> 
> Dave
> 
> On Tue, 2005-03-08 at 10:11 +1100, Chris Samuel wrote:
> > On Tue, 8 Mar 2005 02:30 am, Steve Traylen wrote:
> > 
> > > Another migration question if moving from a torque (torque-1.0.1p6) with
> > > the default '--enable-rpp' to the (torque-1.2.0p1) with '--disable-rpp'
> > > then what are the potential pitfalls with that. I'm less worried about
> > > the torque upgrade itself but changing the protocol would I assume
> > > be significant.
> > 
> > Hmm, that's something I'm not sure about.   We've been upgrading from time to 
> > time and sometimes that's been with the system running and sometimes that's 
> > been when we've had electrical power work about to occur and have had to shut 
> > the cluster down.  I can't place when that change happened in the grand 
> > scheme of things so I'm not in a position to comment authoritatively.
> > 
> > > Should I drain running, queued jobs or both? Can I just upgrade everything
> > > and restart everything and will everything be happy?
> > 
> > I would have thought that the change in the protocol between the MOM's and the 
> > server will require restarting all the components, and if you have some users 
> > using Pete Wyckoff's mpiexec to launch parallel MPI jobs (as we do) then 
> > those would certainly be adversely affected by this.
> > 
> > However, we've always upgraded with queued jobs waiting, and the only time 
> > that this has bitten us was with the change to the length of the PBS job ID.
> > 
> > Of course, I have to disclaim all liability for this information, caveat 
> > emptor, batteries not included, if it breaks you get to keep both pieces, 
> > don't blame me if you loose all your queued jobs or your cluster develops 
> > emergent behaviour and takes over the world...
> > 
> > In short, the SuperCluster developers would be more helpful than me on 
> > this. ;-)
> > 
> > Good luck!
> > Chris
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://supercluster.org/mailman/listinfo/torqueusers
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://supercluster.org/mailman/listinfo/torqueusers

-- 
Steve Traylen
s.traylen at rl.ac.uk
http://www.gridpp.ac.uk/


More information about the torqueusers mailing list