[torqueusers] How to upgrade torque safely?

Steve Young chemadm at hamilton.edu
Thu Apr 24 16:51:58 MDT 2008


Hi,
	I've upgraded my pbs_server and never lost a job. qterm -t quick  
will kill the server but allow the jobs to continue to run on the  
nodes. Then I restart with the new server binary and the nodes report  
back in and everything appears to continue running. Server done. Now  
for the nodes I do as Jerry suggests. Create an admin reservation on  
that node at a time when the node will be finished with it's current  
job (to prevent more jobs from running on it). Then I restart with  
the new pbs_mom  and remove the reservation to allow jobs to start  
running on the node again. So far nodes running older versions of  
pbs_mom continue to work with the new server binary until each one  
can be restarted. Probably, not the best way to do it but it was  
worth a shot to see if it could be done without losing anything. As  
far as I can tell it seems to work.


-Steve

On Apr 24, 2008, at 10:46 AM, Jerry Smith wrote:

> Jim,
>
> Nice write up.  Here are a few things we do differently to maybe  
> offer some insight from another site.
>
> We build our scheduling systems in NFS space. Which allows us to  
> change versions quickly via symlinks.
> Example:
> /apps is an NFS mount
> /apps/torque-2.1.0p8
> /apps/torque-2.2.0
>
> We then symlink /apps/torque to one of those versions and have all  
> of our init scripts point to /apps/torque.  If the newer version  
> fails, then we just switch the symlink back to the prior release  
> and keep on computing.
> We also symlink $PBS_HOME/mom_priv to NFS space so if we make  
> changes to the config/prologue/epilogue it propagates immediately  
> to ALL nodes.
>
> A question about qrerun for you.  Do your users ever comment on  
> having their jobs requeued?  Our user community does not like that  
> option, as they may have a job that has run many days, and if I  
> qrerun it, it starts right back at the beginning, overwriting  
> possibly large amounts of data.  They would rather us kill the job  
> and let them resubmit, most times using a restart file in  
> conjunction with the "depends" qsub option ie ...  run job5 only  
> after job4 finishes ( using the restart files created from job4)
>
> For us, protecting the workload that is currently in "queued/idle"  
> we use Moab, (the same can be done with maui), is instead of  
> stopping the queues, we create a system wide reservation for an  
> admin user who then runs a suite of test jobs.
>
> Jerry Smith
>
>
> James J Coyle wrote:
>>
>> Weiguang,
>>
>>    I'm very cautious and want to keep users aware of a major change
>> in the system, which they appreciate.  Similar procedures take place
>> for any major system software.  I've never had to roll back  
>> torque, but
>> I have had to for other software whose new version did not work.
>>
>>    I do the following:
>>
>> Prep work:
>> -------------
>> 1) Install the new version on a 2 node test system to make sure the
>>    new version works OK.  (Two old pentium III's in my office.)
>>
>> 2) Do the configure and make (in a new directory not over the old  
>> torque
>>   in case I need to roll back) on the main system but wait with  
>> make install.
>>   (Install will be quick then.)
>>
>> 3) Create a testq that user don't submit to, but I can submit a job
>>     to check the system the new torque is up and OK.
>>
>> 4) Announce the upgrade with /etc/motd and/or a email to
>>     the group(s) involved. I do this 1 day before the longest queue
>>     walltime limit so users can decide if they want job to be running
>>     when I upgrade.
>>     (Most don't care but appreciate being kept in the loop.)
>>
>> On day of upgrade:
>> -------------------
>> 5) issue qstop @clustername
>>   to stop all the queues.
>>
>> 6) drain the system of running jobs the, either by letting jobs  
>> run to
>>     completion, or by issuing qrerun for each of the running jobs.
>>      (My notification in 1) tells users which I will do, which is
>>        usually qrerun.)
>>
>> 7) Stop all pbs deamons both on head node and all the pbs_moms.
>>
>> 8) Issue make install for the new version, installing on head node  
>> and
>>   old nodes, then start all the torque deamons, but leave the  
>> queues stopped.
>>
>> 9) Start the testq (qstart testq), and submit several short test jobs
>>     using a variety of node sizes to satisfy myself that all is OK  
>> with
>>     the cluster.
>>
>> 10) issue qstart @clustrname, or start queue by queue. (I usually  
>> start the
>>     shortest time queue to make sure that nothing goes wrong) then I
>>     start the most resource intensive queue (usually most nodes) down
>>     to the least resource intensive so that these jobs get started  
>> again.
>>
>>    Before 10) I am ready to roll back, re-installing the old
>> version just by
>> killing the new versions,
>> changing to the old version's directory,
>> issuing make install there,
>> starting the old version daemons and
>> restarting all the queues.
>>
>> Good luck,
>>  - Jim Coyle
>>
>> --
>>  James Coyle, PhD
>>  SGI Origin, Alpha, Xeon and Opteron Cluster Manager
>>  High Performance Computing Group
>>  235 Durham CenterG
>>  Iowa State Univ.
>>  Ames, Iowa 50011           web: http://jjc.public.iastate.edu
>>
>>
>>> --===============2103503296==
>>> Content-Type: multipart/alternative;
>>>       boundary="----=_Part_9500_29122432.1209024550344"
>>>
>>> ------=_Part_9500_29122432.1209024550344
>>> Content-Type: text/plain; charset=ISO-8859-1
>>> Content-Transfer-Encoding: 7bit
>>> Content-Disposition: inline
>>>
>>> Hi,
>>> We are using torque-2.1.2, i want to upgrade it to the latest  
>>> version and
>>> don't affect the running jobs. I feel the method in the manual is  
>>> too
>>> simple.
>>> Who had done that? Can you give me some advices and what i must  
>>> notice?
>>> Thanks
>>>
>>> --
>>> Best Wishes
>>> ChenWeiguang
>>>
>>> ------=_Part_9500_29122432.1209024550344
>>> Content-Type: text/html; charset=ISO-8859-1
>>> Content-Transfer-Encoding: 7bit
>>> Content-Disposition: inline
>>>
>>> Hi,<br>We are using torque-2.1.2, i want to upgrade it to the  
>>> latest version and don&#39;t affect the running jobs. I feel the  
>>> method in the manual is too simple.<br>Who had done that? Can you  
>>> give me some advices and what i must notice?<br>
>>> Thanks <br clear="all"><br>-- <br>Best Wishes<br>ChenWeiguang
>>>
>>> ------=_Part_9500_29122432.1209024550344--
>>>
>>> --===============2103503296==
>>> Content-Type: text/plain; charset="us-ascii"
>>> MIME-Version: 1.0
>>> Content-Transfer-Encoding: 7bit
>>> Content-Disposition: inline
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>> --===============2103503296==--
>>>
>>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20080424/c3a6b9e7/attachment.html


More information about the torqueusers mailing list