[torqueusers] How to upgrade torque safely?
James J Coyle
jjc at iastate.edu
Thu Apr 24 08:24:23 MDT 2008
Weiguang,
I'm very cautious and want to keep users aware of a major change
in the system, which they appreciate. Similar procedures take place
for any major system software. I've never had to roll back torque, but
I have had to for other software whose new version did not work.
I do the following:
Prep work:
-------------
1) Install the new version on a 2 node test system to make sure the
new version works OK. (Two old pentium III's in my office.)
2) Do the configure and make (in a new directory not over the old torque
in case I need to roll back) on the main system but wait with make install.
(Install will be quick then.)
3) Create a testq that user don't submit to, but I can submit a job
to check the system the new torque is up and OK.
4) Announce the upgrade with /etc/motd and/or a email to
the group(s) involved. I do this 1 day before the longest queue
walltime limit so users can decide if they want job to be running
when I upgrade.
(Most don't care but appreciate being kept in the loop.)
On day of upgrade:
-------------------
5) issue qstop @clustername
to stop all the queues.
6) drain the system of running jobs the, either by letting jobs run to
completion, or by issuing qrerun for each of the running jobs.
(My notification in 1) tells users which I will do, which is
usually qrerun.)
7) Stop all pbs deamons both on head node and all the pbs_moms.
8) Issue make install for the new version, installing on head node and
old nodes, then start all the torque deamons, but leave the queues stopped.
9) Start the testq (qstart testq), and submit several short test jobs
using a variety of node sizes to satisfy myself that all is OK with
the cluster.
10) issue qstart @clustrname, or start queue by queue. (I usually start the
shortest time queue to make sure that nothing goes wrong) then I
start the most resource intensive queue (usually most nodes) down
to the least resource intensive so that these jobs get started again.
Before 10) I am ready to roll back, re-installing the old
version just by
killing the new versions,
changing to the old version's directory,
issuing make install there,
starting the old version daemons and
restarting all the queues.
Good luck,
- Jim Coyle
--
James Coyle, PhD
SGI Origin, Alpha, Xeon and Opteron Cluster Manager
High Performance Computing Group
235 Durham Center
Iowa State Univ.
Ames, Iowa 50011 web: http://jjc.public.iastate.edu
> --===============2103503296==
> Content-Type: multipart/alternative;
> boundary="----=_Part_9500_29122432.1209024550344"
>
> ------=_Part_9500_29122432.1209024550344
> Content-Type: text/plain; charset=ISO-8859-1
> Content-Transfer-Encoding: 7bit
> Content-Disposition: inline
>
> Hi,
> We are using torque-2.1.2, i want to upgrade it to the latest version and
> don't affect the running jobs. I feel the method in the manual is too
> simple.
> Who had done that? Can you give me some advices and what i must notice?
> Thanks
>
> --
> Best Wishes
> ChenWeiguang
>
> ------=_Part_9500_29122432.1209024550344
> Content-Type: text/html; charset=ISO-8859-1
> Content-Transfer-Encoding: 7bit
> Content-Disposition: inline
>
> Hi,<br>We are using torque-2.1.2, i want to upgrade it to the latest version and don't affect the running jobs. I feel the method in the manual is too simple.<br>Who had done that? Can you give me some advices and what i must notice?<br>
> Thanks <br clear="all"><br>-- <br>Best Wishes<br>ChenWeiguang
>
> ------=_Part_9500_29122432.1209024550344--
>
> --===============2103503296==
> Content-Type: text/plain; charset="us-ascii"
> MIME-Version: 1.0
> Content-Transfer-Encoding: 7bit
> Content-Disposition: inline
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
> --===============2103503296==--
>
More information about the torqueusers
mailing list