[torquedev] [torqueusers] TORQUE 2.4 is live.
Garrick Staples
garrick at usc.edu
Mon Nov 2 11:15:11 MST 2009
The tarball doesn't have a release minor; so what is the new numbering scheme?
On Mon, Nov 02, 2009 at 10:03:02AM -0700, Ken Nielson alleged:
> TORQUE version 2.4 was officially released on Thursday, October 29,
> 2009. It will be available for download at
> http://www.clusterresources.com/downloads/torque/torque-2.4.tar.gz
>
> A new branch has been created in the subversion tree under
> torque/branches/2.4-fixes. This is where bug fixes are to be made for
> TORQUE 2.4.
>
> Also see http://www.clusterresources.com/products/torque/docs/ for
> updated TORQUE documentation.
>
> Some of the feature highlights are improved high availability, job
> arrays, improved job check pointing, per job epilogue and prologue
> scripts, and service jobs. Below is the CHANGELOG for this build.
>
> Please note if you are running TORQUE 2.3.x and do not need any of the
> new features provided in 2.4 do not feel obligated to upgrade. 2.3.x
> will continue to be supported.
>
>
> c - crash b - bug fix e - enhancement f - new feature
>
> 2.4.2
>
> b - Added pbs_error_db.h to src/include/Makefile.am and
> src/include/Makefile.in.
> pbs_error_db.h now needed for install.
>
> e - Modified pbs_get_server_list so the $TORQUE_HOME/server_name file
> will work with
>
> a comma delimited string or a list of server names separated by a new line.
>
> b - fix tracejob so it handles multiple server and mom logs for the same
> day
>
> f - Added a new server parameter np_default. This allows the
> administrator to
> change the number of processors to a unified value dynamically for the
> entire cluster.
>
> e - high availability enhanced so that the server spawns a separate
> thread to
> update the "lock" on the lockfile. Thread update and check time are both
> setable parameters in qmgr.
>
> b - close empty ACL files
>
> 2.4.1
>
> e - added a prologue and epilogue option to the list of resources for
> qsub -l
> which allows a per job prologue or epilogue script. The syntax for
> the new option is qsub -l prologue=<prologue script>,epilogue=<epilogue
> script>
>
> f - added a "-w" option to qsub to override the working directory
>
> e - changes needed to allow relocatable checkpoint jobs. Job checkpoint
> files
> are now under the control of the server.
>
> c - check filename for NULL to prevent crash
>
> b - changed so we don't try to copy a local file when the destination is
> a directory and the file is already in that directory
>
> f - changes to allow TORQUE to operate without pbs_iff (merged from 2.3)
>
> e - made logging functions rentrant safe by using localtime_r instead of
> localtime() (merged from 2.3)
>
> e - Merged in more logging and NOSIGCHLDMOM capability from Yahoo branch
>
> e - merged in new log_ext() function to allow more fine grained syslog
> events, you can now specify severity level. Also added more logging
> statements
>
> b - fixed a bug where CPU time was not being added up properly in all
> cases. (fix for Linux only)
>
> c - fixed a few memory errors due to some uninitialized memory being
> allocated. (ported from 2.3 R2493)
>
> e - added code to allow compilers to override CLONE_BATCH_SIZE at
> configure time (allows for finer grained control on how arrays are
> created) (ported from Yahoo R2461)
>
> e - added code which prefixes the severity tag on all log_ext() and
> log_err() messages (ported from Yahoo R2358)
>
> f - added code from 2.3-extreme that allows TORQUE to handle more than
> 1024 sockets. Also, increased the size of TORQUE's internal socket
> handle table to avoid running out of handles under busy conditions.
>
> e - TORQUE can now handle server names larger than 64 bytes (now set to
> 1024, which should be larger than the max for hostnames)
>
> e - added qmgr option accounting_keep_days, specifies how long to keep
> accounting files.
>
> e - changed mom config varattr so invoked script returns the varattr
> name and value(s)
>
> e - improved the performance of pbs_server when submitting large numbers
> of jobs with dependencies defined
>
> e - added new parameter "log_keep_days" to both pbs_server and pbs_mom.
> Specifies how long to keep log files before they are automatically removed
>
> e - added qmgr server attribute lock_file, specifies where server lock
> file is located
>
> b - change so we use default file name for output / error file when just
> a directory is specified on qsub / qalter -e -o options
>
> e - modified to allow retention of completed jobs across server shutdown
>
> e - added job_must_report qmgr configuration which says the job must be
> reported to scheduler. Added job attribute "reported". Added PURGECOMP
> functionality which allows scheduler to confirm jobs are reported. Also
> added -c option to qdel. Used to clean up unreported jobs.
>
> b - Fix so interactive jobs run when using $job_output_file_umask
> userdefault
>
> f - Allow adding extra End accounting record for a running job that is
> rerun. Provides usage data. Enabled by CFLAGS=-DRERUNUSAGE.
>
> b - Fix to use queue/server resources_defaults to validate mppnodect
> against resources_max when mppwidth or mppnppn are not specified for job
>
> f - merged in new dynamic array struct and functions to implement a new
> (and more efficient) way of loading jobs at startup--should help by 2
> orders of magnitude!
>
> f - changed TORQUE_MAXCONNECTTIMEOUT to be a global variable that is now
> changed by the MOM to be smaller than the pbs_server and is also
> configurable on the MOM ($max_conn_timeout_micro_sec)
>
> e - change so queued jobs that get deleted go to complete and get
> displayed in qstat based on keep_completed
>
> b - Changes to improve the qstat -x XML output and documentation
>
> b - Change so BATCH_PARTITION_ID does not pass through to child jobs
>
> c - fix to prevent segfault on pbs_server -t cold
>
> b - fix so find_resc_entry still works after setting server extra_resc
>
> c - keep pbs_server from trying to free empty attrlist after recieving
>
> bad request (Michael Meier, University of Erlangen-Nurnberg) (merged
> from 2.3.8)
>
> f - new fifo scheduler config option. ignore_queue: queue_name allows
> the scheduler to be instructed to ignore up to 16 queues on the server
> (Simon Toth, muni.cz)
>
> e - add administrator customizable email notifications (see manpage for
> pbs_server_attributes) - (Roland Haas, Georgia Tech)
>
> e - moving jobs can now trigger a scheduling iteration (merged from 2.3.8)
>
> e - created a utility module that is shared between both server and mom
> but does NOT get placed in the libtorque library
>
> e - allow the user to request a specific processor geometry for their
> job using
>
> a bitmap, and then bind their jobs to those processors using cpusets.
>
> b - fix how qsub sets PBS_O_HOST and PBS_SERVER (Eirikur Hjartarson,
> deCODE genetics) (merged from 2.3.8)
>
> b - fix to prevent some jobs from getting deleted on startup.
>
> f - add qpool.gz to contrib directory
>
> e - improve how error constants and text messages are represented (Simon
> Toth, muni.cz)
>
> f - new boolean queue attribute "is_transit" that allows jobs to exceede
> server resource limits (queue limits are respected). This allows routing
> queues to route jobs that would be rejected for exceeding local
> resources even when the job won't be run locally. (Simon Toth, muni.cz)
>
> e - add support for "job_array" as a type for queue disallowed_types
> attribute
>
> e - added pbs_mom config option ignmem to ignore mem/pmem limit enforcement
>
> e - added pbs_mom config option igncput to ignore pcput limit enforcement
>
>
> 2.4.0
>
> f - added a "-q" option to pbs_mom which does *not* perform the default
> -p behavior
>
> e - made "pbs_mom -p" the default option when starting pbs_mom
>
> e - added -q to qalter to allow quicker response to modify requests
>
> f - added basic qhold support for job arrays
>
> b - clear out ji_destin in obit_reply
>
> f - add qchkpt command
>
> e - renamed job.h to pbs_job.h
>
> b - fix logic error in checkpoint interval test
>
> f - add RERUNNABLEBYDEFAULT parameter to torque.cfg. allows admin to
> change the default value of the job rerunnable attribute from true to false
>
> e - added preliminary Comprehensive System Accounting (CSA)
> functionality for Linux. Configure option --enable-csa will cause
> workload management records to be written if CSA is installed and wkmg
> is turned on.
>
> b - changes to allow post_checkpoint() to run when checkpoint is
> completed, not when it has just started. Also corrected issue when
> checkpoint fails while trying to put job on hold.
>
> b - update server immediately with changed checkpoint name and time
> attributes after successful checkpoint.
>
> e - Changes so checkpoint jobs failing after restarted are put on hold
> or requeued
>
> e - Added checkpoint_restart_status job attribute used for restart status
>
> b - Updated manpages for qsub and qterm to reflect changed checkpointing
> options.
>
> b - reject a qchkpt request if checkpointing is not enabled for the job
>
> b - Mom should not send checkpoint name and time to server unless
> checkpoint was successful
>
> b - fix so that running jobs that have a hold type and that fail on
> checkpoint restart get deleted when qdel is used
>
> b - fix so we reset start_time, if needed, when restarting a
> checkpointed job
>
> f - added experimental fault_tolerant job attribute (set to true by passing
>
> -f to qsub) this attribute indicates that a job can survive the loss of
> a sister mom also added corresponding fault_tolerant and
> fault_intolerant types to the "disallowed_types" queue attribute
>
> b - fixes for pbs_moms updating of comment and checkpoint name and time
>
> e - change so we can reject hold requests on running jobs that do not have
>
> checkpoint enabled if system was configured with --enable-blcr
>
> e - change to qsub so only the host name can be specified on the -e/-o
> options
>
> e - added -w option to qsub that allows setting of PBS_O_WORKDIR
>
>
> Ken Nielson
> Adaptive Computing
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
--
Garrick Staples, GNU/Linux HPCC SysAdmin
University of Southern California
Life is Good!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torquedev/attachments/20091102/bed661b2/attachment.bin
More information about the torquedev
mailing list