[torquedev] [torqueusers] TORQUE 2.4 is live.

Garrick Staples garrick at usc.edu
Mon Nov 2 11:15:11 MST 2009


The tarball doesn't have a release minor; so what is the new numbering scheme?

On Mon, Nov 02, 2009 at 10:03:02AM -0700, Ken Nielson alleged:
> TORQUE version 2.4 was officially released on Thursday, October 29, 
> 2009. It will be available for download at 
> http://www.clusterresources.com/downloads/torque/torque-2.4.tar.gz
> 
> A new branch has been created in the subversion tree under 
> torque/branches/2.4-fixes. This is where bug fixes are to be made for 
> TORQUE 2.4.
> 
> Also see http://www.clusterresources.com/products/torque/docs/ for 
> updated TORQUE documentation.
> 
> Some of the feature highlights are improved high availability, job 
> arrays, improved job check pointing, per job epilogue and prologue 
> scripts, and service jobs. Below is the CHANGELOG for this build.
> 
> Please note if you are running TORQUE 2.3.x and do not need any of the 
> new features provided in 2.4 do not feel obligated to upgrade. 2.3.x 
> will continue to be supported.
> 
> 
> c - crash  b - bug fix   e - enhancement   f - new feature
> 
> 2.4.2
> 
> b - Added pbs_error_db.h to src/include/Makefile.am and 
> src/include/Makefile.in.
> pbs_error_db.h now needed for install.
> 
> e - Modified pbs_get_server_list so the $TORQUE_HOME/server_name file 
> will work with
> 
> a comma delimited string or a list of server names separated by a new line.
> 
> b - fix tracejob so it handles multiple server and mom logs for the same 
> day
> 
> f - Added a new server parameter np_default. This allows the 
> administrator to
> change the number of processors to a unified value dynamically for the
> entire cluster.
> 
> e - high availability enhanced so that the server spawns a separate 
> thread to
> update the "lock" on the lockfile. Thread update and check time are both
> setable parameters in qmgr.
> 
> b - close empty ACL files
> 
> 2.4.1
> 
> e - added a prologue and epilogue option to the list of resources for 
> qsub -l
> which allows a per job prologue or epilogue script. The syntax for
> the new option is qsub -l prologue=<prologue script>,epilogue=<epilogue 
> script>
> 
> f - added a "-w" option to qsub to override the working directory
> 
> e - changes needed to allow relocatable checkpoint jobs. Job checkpoint 
> files
> are now under the control of the server.
> 
> c - check filename for NULL to prevent crash
> 
> b - changed so we don't try to copy a local file when the destination is 
> a directory and the file is already in that directory
> 
> f - changes to allow TORQUE to operate without pbs_iff (merged from 2.3)
> 
> e - made logging functions rentrant safe by using localtime_r instead of 
> localtime() (merged from 2.3)
> 
> e - Merged in more logging and NOSIGCHLDMOM capability from Yahoo branch
> 
> e - merged in new log_ext() function to allow more fine grained syslog 
> events, you can now specify severity level. Also added more logging 
> statements
> 
> b - fixed a bug where CPU time was not being added up properly in all 
> cases. (fix for Linux only)
> 
> c - fixed a few memory errors due to some uninitialized memory being 
> allocated. (ported from 2.3 R2493)
> 
> e - added code to allow compilers to override CLONE_BATCH_SIZE at 
> configure time (allows for finer grained control on how arrays are 
> created) (ported from Yahoo R2461)
> 
> e - added code which prefixes the severity tag on all log_ext() and 
> log_err() messages (ported from Yahoo R2358)
> 
> f - added code from 2.3-extreme that allows TORQUE to handle more than 
> 1024 sockets. Also, increased the size of TORQUE's internal socket 
> handle table to avoid running out of handles under busy conditions.
> 
> e - TORQUE can now handle server names larger than 64 bytes (now set to 
> 1024, which should be larger than the max for hostnames)
> 
> e - added qmgr option accounting_keep_days, specifies how long to keep 
> accounting files.
> 
> e - changed mom config varattr so invoked script returns the varattr 
> name and value(s)
> 
> e - improved the performance of pbs_server when submitting large numbers 
> of jobs with dependencies defined
> 
> e - added new parameter "log_keep_days" to both pbs_server and pbs_mom. 
> Specifies how long to keep log files before they are automatically removed
> 
> e - added qmgr server attribute lock_file, specifies where server lock 
> file is located
> 
> b - change so we use default file name for output / error file when just 
> a directory is specified on qsub / qalter -e -o options
> 
> e - modified to allow retention of completed jobs across server shutdown
> 
> e - added job_must_report qmgr configuration which says the job must be 
> reported to scheduler. Added job attribute "reported". Added PURGECOMP 
> functionality which allows scheduler to confirm jobs are reported. Also 
> added -c option to qdel. Used to clean up unreported jobs.
> 
> b - Fix so interactive jobs run when using $job_output_file_umask 
> userdefault
> 
> f - Allow adding extra End accounting record for a running job that is 
> rerun. Provides usage data. Enabled by CFLAGS=-DRERUNUSAGE.
> 
> b - Fix to use queue/server resources_defaults to validate mppnodect 
> against resources_max when mppwidth or mppnppn are not specified for job
> 
> f - merged in new dynamic array struct and functions to implement a new 
> (and more efficient) way of loading jobs at startup--should help by 2 
> orders of magnitude!
> 
> f - changed TORQUE_MAXCONNECTTIMEOUT to be a global variable that is now 
> changed by the MOM to be smaller than the pbs_server and is also 
> configurable on the MOM ($max_conn_timeout_micro_sec)
> 
> e - change so queued jobs that get deleted go to complete and get 
> displayed in qstat based on keep_completed
> 
> b - Changes to improve the qstat -x XML output and documentation
> 
> b - Change so BATCH_PARTITION_ID does not pass through to child jobs
> 
> c - fix to prevent segfault on pbs_server -t cold
> 
> b - fix so find_resc_entry still works after setting server extra_resc
> 
> c - keep pbs_server from trying to free empty attrlist after recieving
> 
> bad request (Michael Meier, University of Erlangen-Nurnberg) (merged 
> from 2.3.8)
> 
> f - new fifo scheduler config option. ignore_queue: queue_name allows 
> the scheduler to be instructed to ignore up to 16 queues on the server 
> (Simon Toth, muni.cz)
> 
> e - add administrator customizable email notifications (see manpage for 
> pbs_server_attributes) - (Roland Haas, Georgia Tech)
> 
> e - moving jobs can now trigger a scheduling iteration (merged from 2.3.8)
> 
> e - created a utility module that is shared between both server and mom 
> but does NOT get placed in the libtorque library
> 
> e - allow the user to request a specific processor geometry for their 
> job using
> 
> a bitmap, and then bind their jobs to those processors using cpusets.
> 
> b - fix how qsub sets PBS_O_HOST and PBS_SERVER (Eirikur Hjartarson, 
> deCODE genetics) (merged from 2.3.8)
> 
> b - fix to prevent some jobs from getting deleted on startup.
> 
> f - add qpool.gz to contrib directory
> 
> e - improve how error constants and text messages are represented (Simon 
> Toth, muni.cz)
> 
> f - new boolean queue attribute "is_transit" that allows jobs to exceede 
> server resource limits (queue limits are respected). This allows routing 
> queues to route jobs that would be rejected for exceeding local 
> resources even when the job won't be run locally. (Simon Toth, muni.cz)
> 
> e - add support for "job_array" as a type for queue disallowed_types 
> attribute
> 
> e - added pbs_mom config option ignmem to ignore mem/pmem limit enforcement
> 
> e - added pbs_mom config option igncput to ignore pcput limit enforcement
> 
> 
> 2.4.0
> 
> f - added a "-q" option to pbs_mom which does *not* perform the default 
> -p behavior
> 
> e - made "pbs_mom -p" the default option when starting pbs_mom
> 
> e - added -q to qalter to allow quicker response to modify requests
> 
> f - added basic qhold support for job arrays
> 
> b - clear out ji_destin in obit_reply
> 
> f - add qchkpt command
> 
> e - renamed job.h to pbs_job.h
> 
> b - fix logic error in checkpoint interval test
> 
> f - add RERUNNABLEBYDEFAULT parameter to torque.cfg. allows admin to 
> change the default value of the job rerunnable attribute from true to false
> 
> e - added preliminary Comprehensive System Accounting (CSA) 
> functionality for Linux. Configure option --enable-csa will cause 
> workload management records to be written if CSA is installed and wkmg 
> is turned on.
> 
> b - changes to allow post_checkpoint() to run when checkpoint is 
> completed, not when it has just started. Also corrected issue when 
> checkpoint fails while trying to put job on hold.
> 
> b - update server immediately with changed checkpoint name and time 
> attributes after successful checkpoint.
> 
> e - Changes so checkpoint jobs failing after restarted are put on hold 
> or requeued
> 
> e - Added checkpoint_restart_status job attribute used for restart status
> 
> b - Updated manpages for qsub and qterm to reflect changed checkpointing 
> options.
> 
> b - reject a qchkpt request if checkpointing is not enabled for the job
> 
> b - Mom should not send checkpoint name and time to server unless 
> checkpoint was successful
> 
> b - fix so that running jobs that have a hold type and that fail on 
> checkpoint restart get deleted when qdel is used
> 
> b - fix so we reset start_time, if needed, when restarting a 
> checkpointed job
> 
> f - added experimental fault_tolerant job attribute (set to true by passing
> 
> -f to qsub) this attribute indicates that a job can survive the loss of 
> a sister mom also added corresponding fault_tolerant and 
> fault_intolerant types to the "disallowed_types" queue attribute
> 
> b - fixes for pbs_moms updating of comment and checkpoint name and time
> 
> e - change so we can reject hold requests on running jobs that do not have
> 
> checkpoint enabled if system was configured with --enable-blcr
> 
> e - change to qsub so only the host name can be specified on the -e/-o 
> options
> 
> e - added -w option to qsub that allows setting of PBS_O_WORKDIR
> 
> 
> Ken Nielson
> Adaptive Computing
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

-- 
Garrick Staples, GNU/Linux HPCC SysAdmin
University of Southern California

Life is Good!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torquedev/attachments/20091102/bed661b2/attachment.bin 


More information about the torquedev mailing list