[torqueusers] How to upgrade torque safely?

James J Coyle jjc at iastate.edu
Thu Apr 24 08:24:23 MDT 2008


Weiguang,

   I'm very cautious and want to keep users aware of a major change
in the system, which they appreciate.  Similar procedures take place
for any major system software.  I've never had to roll back torque, but
I have had to for other software whose new version did not work.

   I do the following:

Prep work:
-------------
1) Install the new version on a 2 node test system to make sure the 
   new version works OK.  (Two old pentium III's in my office.)

2) Do the configure and make (in a new directory not over the old torque 
  in case I need to roll back) on the main system but wait with make install.
  (Install will be quick then.)

3) Create a testq that user don't submit to, but I can submit a job 
    to check the system the new torque is up and OK.

4) Announce the upgrade with /etc/motd and/or a email to 
    the group(s) involved. I do this 1 day before the longest queue 
    walltime limit so users can decide if they want job to be running 
    when I upgrade.
    (Most don't care but appreciate being kept in the loop.)

On day of upgrade:
-------------------
5) issue qstop @clustername
  to stop all the queues.

6) drain the system of running jobs the, either by letting jobs run to 
    completion, or by issuing qrerun for each of the running jobs.
     (My notification in 1) tells users which I will do, which is 
       usually qrerun.)

7) Stop all pbs deamons both on head node and all the pbs_moms.

8) Issue make install for the new version, installing on head node and 
  old nodes, then start all the torque deamons, but leave the queues stopped.

9) Start the testq (qstart testq), and submit several short test jobs 
    using a variety of node sizes to satisfy myself that all is OK with 
    the cluster. 

10) issue qstart @clustrname, or start queue by queue. (I usually start the 
    shortest time queue to make sure that nothing goes wrong) then I 
    start the most resource intensive queue (usually most nodes) down 
    to the least resource intensive so that these jobs get started again.

   Before 10) I am ready to roll back, re-installing the old 
version just by 
killing the new versions, 
changing to the old version's directory, 
issuing make install there,
starting the old version daemons and 
restarting all the queues.

Good luck,
 - Jim Coyle

-- 
 James Coyle, PhD
 SGI Origin, Alpha, Xeon and Opteron Cluster Manager
 High Performance Computing Group     
 235 Durham Center            
 Iowa State Univ.         
 Ames, Iowa 50011           web: http://jjc.public.iastate.edu

> --===============2103503296==
> Content-Type: multipart/alternative; 
> 	boundary="----=_Part_9500_29122432.1209024550344"
> 
> ------=_Part_9500_29122432.1209024550344
> Content-Type: text/plain; charset=ISO-8859-1
> Content-Transfer-Encoding: 7bit
> Content-Disposition: inline
> 
> Hi,
> We are using torque-2.1.2, i want to upgrade it to the latest version and
> don't affect the running jobs. I feel the method in the manual is too
> simple.
> Who had done that? Can you give me some advices and what i must notice?
> Thanks
> 
> -- 
> Best Wishes
> ChenWeiguang
> 
> ------=_Part_9500_29122432.1209024550344
> Content-Type: text/html; charset=ISO-8859-1
> Content-Transfer-Encoding: 7bit
> Content-Disposition: inline
> 
> Hi,<br>We are using torque-2.1.2, i want to upgrade it to the latest version and don&#39;t affect the running jobs. I feel the method in the manual is too simple.<br>Who had done that? Can you give me some advices and what i must notice?<br>
> Thanks <br clear="all"><br>-- <br>Best Wishes<br>ChenWeiguang
> 
> ------=_Part_9500_29122432.1209024550344--
> 
> --===============2103503296==
> Content-Type: text/plain; charset="us-ascii"
> MIME-Version: 1.0
> Content-Transfer-Encoding: 7bit
> Content-Disposition: inline
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> --===============2103503296==--
> 




More information about the torqueusers mailing list