[torqueusers] How to upgrade torque safely?
Steve Young
chemadm at hamilton.edu
Thu Apr 24 16:51:58 MDT 2008
Hi,
I've upgraded my pbs_server and never lost a job. qterm -t quick
will kill the server but allow the jobs to continue to run on the
nodes. Then I restart with the new server binary and the nodes report
back in and everything appears to continue running. Server done. Now
for the nodes I do as Jerry suggests. Create an admin reservation on
that node at a time when the node will be finished with it's current
job (to prevent more jobs from running on it). Then I restart with
the new pbs_mom and remove the reservation to allow jobs to start
running on the node again. So far nodes running older versions of
pbs_mom continue to work with the new server binary until each one
can be restarted. Probably, not the best way to do it but it was
worth a shot to see if it could be done without losing anything. As
far as I can tell it seems to work.
-Steve
On Apr 24, 2008, at 10:46 AM, Jerry Smith wrote:
> Jim,
>
> Nice write up. Here are a few things we do differently to maybe
> offer some insight from another site.
>
> We build our scheduling systems in NFS space. Which allows us to
> change versions quickly via symlinks.
> Example:
> /apps is an NFS mount
> /apps/torque-2.1.0p8
> /apps/torque-2.2.0
>
> We then symlink /apps/torque to one of those versions and have all
> of our init scripts point to /apps/torque. If the newer version
> fails, then we just switch the symlink back to the prior release
> and keep on computing.
> We also symlink $PBS_HOME/mom_priv to NFS space so if we make
> changes to the config/prologue/epilogue it propagates immediately
> to ALL nodes.
>
> A question about qrerun for you. Do your users ever comment on
> having their jobs requeued? Our user community does not like that
> option, as they may have a job that has run many days, and if I
> qrerun it, it starts right back at the beginning, overwriting
> possibly large amounts of data. They would rather us kill the job
> and let them resubmit, most times using a restart file in
> conjunction with the "depends" qsub option ie ... run job5 only
> after job4 finishes ( using the restart files created from job4)
>
> For us, protecting the workload that is currently in "queued/idle"
> we use Moab, (the same can be done with maui), is instead of
> stopping the queues, we create a system wide reservation for an
> admin user who then runs a suite of test jobs.
>
> Jerry Smith
>
>
> James J Coyle wrote:
>>
>> Weiguang,
>>
>> I'm very cautious and want to keep users aware of a major change
>> in the system, which they appreciate. Similar procedures take place
>> for any major system software. I've never had to roll back
>> torque, but
>> I have had to for other software whose new version did not work.
>>
>> I do the following:
>>
>> Prep work:
>> -------------
>> 1) Install the new version on a 2 node test system to make sure the
>> new version works OK. (Two old pentium III's in my office.)
>>
>> 2) Do the configure and make (in a new directory not over the old
>> torque
>> in case I need to roll back) on the main system but wait with
>> make install.
>> (Install will be quick then.)
>>
>> 3) Create a testq that user don't submit to, but I can submit a job
>> to check the system the new torque is up and OK.
>>
>> 4) Announce the upgrade with /etc/motd and/or a email to
>> the group(s) involved. I do this 1 day before the longest queue
>> walltime limit so users can decide if they want job to be running
>> when I upgrade.
>> (Most don't care but appreciate being kept in the loop.)
>>
>> On day of upgrade:
>> -------------------
>> 5) issue qstop @clustername
>> to stop all the queues.
>>
>> 6) drain the system of running jobs the, either by letting jobs
>> run to
>> completion, or by issuing qrerun for each of the running jobs.
>> (My notification in 1) tells users which I will do, which is
>> usually qrerun.)
>>
>> 7) Stop all pbs deamons both on head node and all the pbs_moms.
>>
>> 8) Issue make install for the new version, installing on head node
>> and
>> old nodes, then start all the torque deamons, but leave the
>> queues stopped.
>>
>> 9) Start the testq (qstart testq), and submit several short test jobs
>> using a variety of node sizes to satisfy myself that all is OK
>> with
>> the cluster.
>>
>> 10) issue qstart @clustrname, or start queue by queue. (I usually
>> start the
>> shortest time queue to make sure that nothing goes wrong) then I
>> start the most resource intensive queue (usually most nodes) down
>> to the least resource intensive so that these jobs get started
>> again.
>>
>> Before 10) I am ready to roll back, re-installing the old
>> version just by
>> killing the new versions,
>> changing to the old version's directory,
>> issuing make install there,
>> starting the old version daemons and
>> restarting all the queues.
>>
>> Good luck,
>> - Jim Coyle
>>
>> --
>> James Coyle, PhD
>> SGI Origin, Alpha, Xeon and Opteron Cluster Manager
>> High Performance Computing Group
>> 235 Durham CenterG
>> Iowa State Univ.
>> Ames, Iowa 50011 web: http://jjc.public.iastate.edu
>>
>>
>>> --===============2103503296==
>>> Content-Type: multipart/alternative;
>>> boundary="----=_Part_9500_29122432.1209024550344"
>>>
>>> ------=_Part_9500_29122432.1209024550344
>>> Content-Type: text/plain; charset=ISO-8859-1
>>> Content-Transfer-Encoding: 7bit
>>> Content-Disposition: inline
>>>
>>> Hi,
>>> We are using torque-2.1.2, i want to upgrade it to the latest
>>> version and
>>> don't affect the running jobs. I feel the method in the manual is
>>> too
>>> simple.
>>> Who had done that? Can you give me some advices and what i must
>>> notice?
>>> Thanks
>>>
>>> --
>>> Best Wishes
>>> ChenWeiguang
>>>
>>> ------=_Part_9500_29122432.1209024550344
>>> Content-Type: text/html; charset=ISO-8859-1
>>> Content-Transfer-Encoding: 7bit
>>> Content-Disposition: inline
>>>
>>> Hi,<br>We are using torque-2.1.2, i want to upgrade it to the
>>> latest version and don't affect the running jobs. I feel the
>>> method in the manual is too simple.<br>Who had done that? Can you
>>> give me some advices and what i must notice?<br>
>>> Thanks <br clear="all"><br>-- <br>Best Wishes<br>ChenWeiguang
>>>
>>> ------=_Part_9500_29122432.1209024550344--
>>>
>>> --===============2103503296==
>>> Content-Type: text/plain; charset="us-ascii"
>>> MIME-Version: 1.0
>>> Content-Transfer-Encoding: 7bit
>>> Content-Disposition: inline
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>> --===============2103503296==--
>>>
>>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20080424/c3a6b9e7/attachment.html
More information about the torqueusers
mailing list