[torqueusers] How to upgrade torque safely?
Weiguang Chen
chenweiguang82 at gmail.com
Thu Apr 24 23:56:44 MDT 2008
Thank you all very much, I will try it.
On Fri, Apr 25, 2008 at 6:51 AM, Steve Young <chemadm at hamilton.edu> wrote:
> Hi, I've upgraded my pbs_server and never lost a job. qterm -t quick will
> kill the server but allow the jobs to continue to run on the nodes. Then I
> restart with the new server binary and the nodes report back in and
> everything appears to continue running. Server done. Now for the nodes I do
> as Jerry suggests. Create an admin reservation on that node at a time when
> the node will be finished with it's current job (to prevent more jobs from
> running on it). Then I restart with the new pbs_mom and remove the
> reservation to allow jobs to start running on the node again. So far nodes
> running older versions of pbs_mom continue to work with the new server
> binary until each one can be restarted. Probably, not the best way to do it
> but it was worth a shot to see if it could be done without losing anything.
> As far as I can tell it seems to work.
>
>
> -Steve
>
> On Apr 24, 2008, at 10:46 AM, Jerry Smith wrote:
>
> Jim,
>
> Nice write up. Here are a few things we do differently to maybe offer some
> insight from another site.
>
> We build our scheduling systems in NFS space. Which allows us to change
> versions quickly via symlinks.
> Example:
> /apps is an NFS mount
> /apps/torque-2.1.0p8
> /apps/torque-2.2.0
>
> We then symlink /apps/torque to one of those versions and have all of our
> init scripts point to /apps/torque. If the newer version fails, then we
> just switch the symlink back to the prior release and keep on computing.
> We also symlink $PBS_HOME/mom_priv to NFS space so if we make changes to
> the config/prologue/epilogue it propagates immediately to ALL nodes.
>
> A question about qrerun for you. Do your users ever comment on having
> their jobs requeued? Our user community does not like that option, as they
> may have a job that has run many days, and if I qrerun it, it starts right
> back at the beginning, overwriting possibly large amounts of data. They
> would rather us kill the job and let them resubmit, most times using a
> restart file in conjunction with the "depends" qsub option ie ... run job5
> only after job4 finishes ( using the restart files created from job4)
>
> For us, protecting the workload that is currently in "queued/idle" we use
> Moab, (the same can be done with maui), is instead of stopping the queues,
> we create a system wide reservation for an admin user who then runs a suite
> of test jobs.
>
> Jerry Smith
>
>
> James J Coyle wrote:
>
> Weiguang,
>
> I'm very cautious and want to keep users aware of a major change
> in the system, which they appreciate. Similar procedures take place
> for any major system software. I've never had to roll back torque, but
> I have had to for other software whose new version did not work.
>
> I do the following:
>
> Prep work:
> -------------
> 1) Install the new version on a 2 node test system to make sure the
> new version works OK. (Two old pentium III's in my office.)
>
> 2) Do the configure and make (in a new directory not over the old torque
> in case I need to roll back) on the main system but wait with make install.
> (Install will be quick then.)
>
> 3) Create a testq that user don't submit to, but I can submit a job
> to check the system the new torque is up and OK.
>
> 4) Announce the upgrade with /etc/motd and/or a email to
> the group(s) involved. I do this 1 day before the longest queue
> walltime limit so users can decide if they want job to be running
> when I upgrade.
> (Most don't care but appreciate being kept in the loop.)
>
> On day of upgrade:
> -------------------
> 5) issue qstop @clustername
> to stop all the queues.
>
> 6) drain the system of running jobs the, either by letting jobs run to
> completion, or by issuing qrerun for each of the running jobs.
> (My notification in 1) tells users which I will do, which is
> usually qrerun.)
>
> 7) Stop all pbs deamons both on head node and all the pbs_moms.
>
> 8) Issue make install for the new version, installing on head node and
> old nodes, then start all the torque deamons, but leave the queues stopped.
>
> 9) Start the testq (qstart testq), and submit several short test jobs
> using a variety of node sizes to satisfy myself that all is OK with
> the cluster.
>
> 10) issue qstart @clustrname, or start queue by queue. (I usually start the
> shortest time queue to make sure that nothing goes wrong) then I
> start the most resource intensive queue (usually most nodes) down
> to the least resource intensive so that these jobs get started again.
>
> Before 10) I am ready to roll back, re-installing the old
> version just by
> killing the new versions,
> changing to the old version's directory,
> issuing make install there,
> starting the old version daemons and
> restarting all the queues.
>
> Good luck,
> - Jim Coyle
>
> --
> James Coyle, PhD
> SGI Origin, Alpha, Xeon and Opteron Cluster Manager
> High Performance Computing Group
> 235 Durham CenterG
> Iowa State Univ.
> Ames, Iowa 50011 web: http://jjc.public.iastate.edu
>
> --===============2103503296==
> Content-Type: multipart/alternative;
> boundary="----=_Part_9500_29122432.1209024550344"
>
> ------=_Part_9500_29122432.1209024550344
> Content-Type: text/plain; charset=ISO-8859-1
> Content-Transfer-Encoding: 7bit
> Content-Disposition: inline
>
> Hi,
> We are using torque-2.1.2, i want to upgrade it to the latest version and
> don't affect the running jobs. I feel the method in the manual is too
> simple.
> Who had done that? Can you give me some advices and what i must notice?
> Thanks
>
> --
> Best Wishes
> ChenWeiguang
>
> ------=_Part_9500_29122432.1209024550344
> Content-Type: text/html; charset=ISO-8859-1
> Content-Transfer-Encoding: 7bit
> Content-Disposition: inline
>
> Hi,<br>We are using torque-2.1.2, i want to upgrade it to the latest version and don't affect the running jobs. I feel the method in the manual is too simple.<br>Who had done that? Can you give me some advices and what i must notice?<br>
> Thanks <br clear="all"><br>-- <br>Best Wishes<br>ChenWeiguang
>
> ------=_Part_9500_29122432.1209024550344--
>
> --===============2103503296==
> Content-Type: text/plain; charset="us-ascii"
> MIME-Version: 1.0
> Content-Transfer-Encoding: 7bit
> Content-Disposition: inline
>
> _______________________________________________
> torqueusers mailing listtorqueusers at supercluster.orghttp://www.supercluster.org/mailman/listinfo/torqueusers
>
> --===============2103503296==--
>
>
>
> _______________________________________________
> torqueusers mailing listtorqueusers at supercluster.orghttp://www.supercluster.org/mailman/listinfo/torqueusers
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
--
Best Wishes
ChenWeiguang
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20080425/e26edcf8/attachment.html
More information about the torqueusers
mailing list