[torqueusers] How to upgrade torque safely?

Weiguang Chen chenweiguang82 at gmail.com
Thu Apr 24 23:56:44 MDT 2008


Thank you all very much, I will try it.

On Fri, Apr 25, 2008 at 6:51 AM, Steve Young <chemadm at hamilton.edu> wrote:

> Hi, I've upgraded my pbs_server and never lost a job. qterm -t quick will
> kill the server but allow the jobs to continue to run on the nodes. Then I
> restart with the new server binary and the nodes report back in and
> everything appears to continue running. Server done. Now for the nodes I do
> as Jerry suggests. Create an admin reservation on that node at a time when
> the node will be finished with it's current job (to prevent more jobs from
> running on it). Then I restart with the new pbs_mom  and remove the
> reservation to allow jobs to start running on the node again. So far nodes
> running older versions of pbs_mom continue to work with the new server
> binary until each one can be restarted. Probably, not the best way to do it
> but it was worth a shot to see if it could be done without losing anything.
> As far as I can tell it seems to work.
>
>
> -Steve
>
> On Apr 24, 2008, at 10:46 AM, Jerry Smith wrote:
>
> Jim,
>
> Nice write up.  Here are a few things we do differently to maybe offer some
> insight from another site.
>
> We build our scheduling systems in NFS space. Which allows us to change
> versions quickly via symlinks.
> Example:
> /apps is an NFS mount
> /apps/torque-2.1.0p8
> /apps/torque-2.2.0
>
> We then symlink /apps/torque to one of those versions and have all of our
> init scripts point to /apps/torque.  If the newer version fails, then we
> just switch the symlink back to the prior release and keep on computing.
> We also symlink $PBS_HOME/mom_priv to NFS space so if we make changes to
> the config/prologue/epilogue it propagates immediately to ALL nodes.
>
> A question about qrerun for you.  Do your users ever comment on having
> their jobs requeued?  Our user community does not like that option, as they
> may have a job that has run many days, and if I qrerun it, it starts right
> back at the beginning, overwriting possibly large amounts of data.  They
> would rather us kill the job and let them resubmit, most times using a
> restart file in conjunction with the "depends" qsub option ie ...  run job5
> only after job4 finishes ( using the restart files created from job4)
>
> For us, protecting the workload that is currently in "queued/idle" we use
> Moab, (the same can be done with maui), is instead of stopping the queues,
> we create a system wide reservation for an admin user who then runs a suite
> of test jobs.
>
> Jerry Smith
>
>
> James J Coyle wrote:
>
> Weiguang,
>
>    I'm very cautious and want to keep users aware of a major change
> in the system, which they appreciate.  Similar procedures take place
> for any major system software.  I've never had to roll back torque, but
> I have had to for other software whose new version did not work.
>
>    I do the following:
>
> Prep work:
> -------------
> 1) Install the new version on a 2 node test system to make sure the
>    new version works OK.  (Two old pentium III's in my office.)
>
> 2) Do the configure and make (in a new directory not over the old torque
>   in case I need to roll back) on the main system but wait with make install.
>   (Install will be quick then.)
>
> 3) Create a testq that user don't submit to, but I can submit a job
>     to check the system the new torque is up and OK.
>
> 4) Announce the upgrade with /etc/motd and/or a email to
>     the group(s) involved. I do this 1 day before the longest queue
>     walltime limit so users can decide if they want job to be running
>     when I upgrade.
>     (Most don't care but appreciate being kept in the loop.)
>
> On day of upgrade:
> -------------------
> 5) issue qstop @clustername
>   to stop all the queues.
>
> 6) drain the system of running jobs the, either by letting jobs run to
>     completion, or by issuing qrerun for each of the running jobs.
>      (My notification in 1) tells users which I will do, which is
>        usually qrerun.)
>
> 7) Stop all pbs deamons both on head node and all the pbs_moms.
>
> 8) Issue make install for the new version, installing on head node and
>   old nodes, then start all the torque deamons, but leave the queues stopped.
>
> 9) Start the testq (qstart testq), and submit several short test jobs
>     using a variety of node sizes to satisfy myself that all is OK with
>     the cluster.
>
> 10) issue qstart @clustrname, or start queue by queue. (I usually start the
>     shortest time queue to make sure that nothing goes wrong) then I
>     start the most resource intensive queue (usually most nodes) down
>     to the least resource intensive so that these jobs get started again.
>
>    Before 10) I am ready to roll back, re-installing the old
> version just by
> killing the new versions,
> changing to the old version's directory,
> issuing make install there,
> starting the old version daemons and
> restarting all the queues.
>
> Good luck,
>  - Jim Coyle
>
> --
>  James Coyle, PhD
>  SGI Origin, Alpha, Xeon and Opteron Cluster Manager
>  High Performance Computing Group
>  235 Durham CenterG
>  Iowa State Univ.
>  Ames, Iowa 50011           web: http://jjc.public.iastate.edu
>
>    --===============2103503296==
> Content-Type: multipart/alternative;
>       boundary="----=_Part_9500_29122432.1209024550344"
>
> ------=_Part_9500_29122432.1209024550344
> Content-Type: text/plain; charset=ISO-8859-1
> Content-Transfer-Encoding: 7bit
> Content-Disposition: inline
>
> Hi,
> We are using torque-2.1.2, i want to upgrade it to the latest version and
> don't affect the running jobs. I feel the method in the manual is too
> simple.
> Who had done that? Can you give me some advices and what i must notice?
> Thanks
>
> --
> Best Wishes
> ChenWeiguang
>
> ------=_Part_9500_29122432.1209024550344
> Content-Type: text/html; charset=ISO-8859-1
> Content-Transfer-Encoding: 7bit
> Content-Disposition: inline
>
> Hi,<br>We are using torque-2.1.2, i want to upgrade it to the latest version and don&#39;t affect the running jobs. I feel the method in the manual is too simple.<br>Who had done that? Can you give me some advices and what i must notice?<br>
> Thanks <br clear="all"><br>-- <br>Best Wishes<br>ChenWeiguang
>
> ------=_Part_9500_29122432.1209024550344--
>
> --===============2103503296==
> Content-Type: text/plain; charset="us-ascii"
> MIME-Version: 1.0
> Content-Transfer-Encoding: 7bit
> Content-Disposition: inline
>
> _______________________________________________
> torqueusers mailing listtorqueusers at supercluster.orghttp://www.supercluster.org/mailman/listinfo/torqueusers
>
> --===============2103503296==--
>
>
>
>  _______________________________________________
> torqueusers mailing listtorqueusers at supercluster.orghttp://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>   _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>


-- 
Best Wishes
ChenWeiguang
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20080425/e26edcf8/attachment.html


More information about the torqueusers mailing list