[torqueusers] How to upgrade torque safely?

Jerry Smith jdsmit at sandia.gov
Thu Apr 24 08:46:34 MDT 2008


Jim,

Nice write up.  Here are a few things we do differently to maybe offer 
some insight from another site.

We build our scheduling systems in NFS space. Which allows us to change 
versions quickly via symlinks.
Example:
/apps is an NFS mount
/apps/torque-2.1.0p8
/apps/torque-2.2.0

We then symlink /apps/torque to one of those versions and have all of 
our init scripts point to /apps/torque.  If the newer version fails, 
then we just switch the symlink back to the prior release and keep on 
computing.
We also symlink $PBS_HOME/mom_priv to NFS space so if we make changes to 
the config/prologue/epilogue it propagates immediately to ALL nodes.

A question about qrerun for you.  Do your users ever comment on having 
their jobs requeued?  Our user community does not like that option, as 
they may have a job that has run many days, and if I qrerun it, it 
starts right back at the beginning, overwriting possibly large amounts 
of data.  They would rather us kill the job and let them resubmit, most 
times using a restart file in conjunction with the "depends" qsub option 
ie ...  run job5 only after job4 finishes ( using the restart files 
created from job4)

For us, protecting the workload that is currently in "queued/idle" we 
use Moab, (the same can be done with maui), is instead of stopping the 
queues, we create a system wide reservation for an admin user who then 
runs a suite of test jobs.

Jerry Smith


James J Coyle wrote:
> Weiguang,
>
>    I'm very cautious and want to keep users aware of a major change
> in the system, which they appreciate.  Similar procedures take place
> for any major system software.  I've never had to roll back torque, but
> I have had to for other software whose new version did not work.
>
>    I do the following:
>
> Prep work:
> -------------
> 1) Install the new version on a 2 node test system to make sure the
>    new version works OK.  (Two old pentium III's in my office.)
>
> 2) Do the configure and make (in a new directory not over the old torque
>   in case I need to roll back) on the main system but wait with make install.
>   (Install will be quick then.)
>
> 3) Create a testq that user don't submit to, but I can submit a job
>     to check the system the new torque is up and OK.
>
> 4) Announce the upgrade with /etc/motd and/or a email to
>     the group(s) involved. I do this 1 day before the longest queue
>     walltime limit so users can decide if they want job to be running
>     when I upgrade.
>     (Most don't care but appreciate being kept in the loop.)
>
> On day of upgrade:
> -------------------
> 5) issue qstop @clustername
>   to stop all the queues.
>
> 6) drain the system of running jobs the, either by letting jobs run to
>     completion, or by issuing qrerun for each of the running jobs.
>      (My notification in 1) tells users which I will do, which is
>        usually qrerun.)
>
> 7) Stop all pbs deamons both on head node and all the pbs_moms.
>
> 8) Issue make install for the new version, installing on head node and
>   old nodes, then start all the torque deamons, but leave the queues stopped.
>
> 9) Start the testq (qstart testq), and submit several short test jobs
>     using a variety of node sizes to satisfy myself that all is OK with
>     the cluster.
>
> 10) issue qstart @clustrname, or start queue by queue. (I usually start the
>     shortest time queue to make sure that nothing goes wrong) then I
>     start the most resource intensive queue (usually most nodes) down
>     to the least resource intensive so that these jobs get started again.
>
>    Before 10) I am ready to roll back, re-installing the old
> version just by
> killing the new versions,
> changing to the old version's directory,
> issuing make install there,
> starting the old version daemons and
> restarting all the queues.
>
> Good luck,
>  - Jim Coyle
>
> --
>  James Coyle, PhD
>  SGI Origin, Alpha, Xeon and Opteron Cluster Manager
>  High Performance Computing Group
>  235 Durham CenterG
>  Iowa State Univ.
>  Ames, Iowa 50011           web: http://jjc.public.iastate.edu
>
>   
>> --===============2103503296==
>> Content-Type: multipart/alternative;
>>       boundary="----=_Part_9500_29122432.1209024550344"
>>
>> ------=_Part_9500_29122432.1209024550344
>> Content-Type: text/plain; charset=ISO-8859-1
>> Content-Transfer-Encoding: 7bit
>> Content-Disposition: inline
>>
>> Hi,
>> We are using torque-2.1.2, i want to upgrade it to the latest version and
>> don't affect the running jobs. I feel the method in the manual is too
>> simple.
>> Who had done that? Can you give me some advices and what i must notice?
>> Thanks
>>
>> --
>> Best Wishes
>> ChenWeiguang
>>
>> ------=_Part_9500_29122432.1209024550344
>> Content-Type: text/html; charset=ISO-8859-1
>> Content-Transfer-Encoding: 7bit
>> Content-Disposition: inline
>>
>> Hi,<br>We are using torque-2.1.2, i want to upgrade it to the latest version and don&#39;t affect the running jobs. I feel the method in the manual is too simple.<br>Who had done that? Can you give me some advices and what i must notice?<br>
>> Thanks <br clear="all"><br>-- <br>Best Wishes<br>ChenWeiguang
>>
>> ------=_Part_9500_29122432.1209024550344--
>>
>> --===============2103503296==
>> Content-Type: text/plain; charset="us-ascii"
>> MIME-Version: 1.0
>> Content-Transfer-Encoding: 7bit
>> Content-Disposition: inline
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>> --===============2103503296==--
>>
>>     
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20080424/ff353d68/attachment.html


More information about the torqueusers mailing list