[torquedev] TORQUE 2.4 is live.
Ken Nielson
knielson at adaptivecomputing.com
Mon Nov 2 10:03:02 MST 2009
TORQUE version 2.4 was officially released on Thursday, October 29,
2009. It will be available for download at
http://www.clusterresources.com/downloads/torque/torque-2.4.tar.gz
A new branch has been created in the subversion tree under
torque/branches/2.4-fixes. This is where bug fixes are to be made for
TORQUE 2.4.
Also see http://www.clusterresources.com/products/torque/docs/ for
updated TORQUE documentation.
Some of the feature highlights are improved high availability, job
arrays, improved job check pointing, per job epilogue and prologue
scripts, and service jobs. Below is the CHANGELOG for this build.
Please note if you are running TORQUE 2.3.x and do not need any of the
new features provided in 2.4 do not feel obligated to upgrade. 2.3.x
will continue to be supported.
c - crash b - bug fix e - enhancement f - new feature
2.4.2
b - Added pbs_error_db.h to src/include/Makefile.am and
src/include/Makefile.in.
pbs_error_db.h now needed for install.
e - Modified pbs_get_server_list so the $TORQUE_HOME/server_name file
will work with
a comma delimited string or a list of server names separated by a new line.
b - fix tracejob so it handles multiple server and mom logs for the same
day
f - Added a new server parameter np_default. This allows the
administrator to
change the number of processors to a unified value dynamically for the
entire cluster.
e - high availability enhanced so that the server spawns a separate
thread to
update the "lock" on the lockfile. Thread update and check time are both
setable parameters in qmgr.
b - close empty ACL files
2.4.1
e - added a prologue and epilogue option to the list of resources for
qsub -l
which allows a per job prologue or epilogue script. The syntax for
the new option is qsub -l prologue=<prologue script>,epilogue=<epilogue
script>
f - added a "-w" option to qsub to override the working directory
e - changes needed to allow relocatable checkpoint jobs. Job checkpoint
files
are now under the control of the server.
c - check filename for NULL to prevent crash
b - changed so we don't try to copy a local file when the destination is
a directory and the file is already in that directory
f - changes to allow TORQUE to operate without pbs_iff (merged from 2.3)
e - made logging functions rentrant safe by using localtime_r instead of
localtime() (merged from 2.3)
e - Merged in more logging and NOSIGCHLDMOM capability from Yahoo branch
e - merged in new log_ext() function to allow more fine grained syslog
events, you can now specify severity level. Also added more logging
statements
b - fixed a bug where CPU time was not being added up properly in all
cases. (fix for Linux only)
c - fixed a few memory errors due to some uninitialized memory being
allocated. (ported from 2.3 R2493)
e - added code to allow compilers to override CLONE_BATCH_SIZE at
configure time (allows for finer grained control on how arrays are
created) (ported from Yahoo R2461)
e - added code which prefixes the severity tag on all log_ext() and
log_err() messages (ported from Yahoo R2358)
f - added code from 2.3-extreme that allows TORQUE to handle more than
1024 sockets. Also, increased the size of TORQUE's internal socket
handle table to avoid running out of handles under busy conditions.
e - TORQUE can now handle server names larger than 64 bytes (now set to
1024, which should be larger than the max for hostnames)
e - added qmgr option accounting_keep_days, specifies how long to keep
accounting files.
e - changed mom config varattr so invoked script returns the varattr
name and value(s)
e - improved the performance of pbs_server when submitting large numbers
of jobs with dependencies defined
e - added new parameter "log_keep_days" to both pbs_server and pbs_mom.
Specifies how long to keep log files before they are automatically removed
e - added qmgr server attribute lock_file, specifies where server lock
file is located
b - change so we use default file name for output / error file when just
a directory is specified on qsub / qalter -e -o options
e - modified to allow retention of completed jobs across server shutdown
e - added job_must_report qmgr configuration which says the job must be
reported to scheduler. Added job attribute "reported". Added PURGECOMP
functionality which allows scheduler to confirm jobs are reported. Also
added -c option to qdel. Used to clean up unreported jobs.
b - Fix so interactive jobs run when using $job_output_file_umask
userdefault
f - Allow adding extra End accounting record for a running job that is
rerun. Provides usage data. Enabled by CFLAGS=-DRERUNUSAGE.
b - Fix to use queue/server resources_defaults to validate mppnodect
against resources_max when mppwidth or mppnppn are not specified for job
f - merged in new dynamic array struct and functions to implement a new
(and more efficient) way of loading jobs at startup--should help by 2
orders of magnitude!
f - changed TORQUE_MAXCONNECTTIMEOUT to be a global variable that is now
changed by the MOM to be smaller than the pbs_server and is also
configurable on the MOM ($max_conn_timeout_micro_sec)
e - change so queued jobs that get deleted go to complete and get
displayed in qstat based on keep_completed
b - Changes to improve the qstat -x XML output and documentation
b - Change so BATCH_PARTITION_ID does not pass through to child jobs
c - fix to prevent segfault on pbs_server -t cold
b - fix so find_resc_entry still works after setting server extra_resc
c - keep pbs_server from trying to free empty attrlist after recieving
bad request (Michael Meier, University of Erlangen-Nurnberg) (merged
from 2.3.8)
f - new fifo scheduler config option. ignore_queue: queue_name allows
the scheduler to be instructed to ignore up to 16 queues on the server
(Simon Toth, muni.cz)
e - add administrator customizable email notifications (see manpage for
pbs_server_attributes) - (Roland Haas, Georgia Tech)
e - moving jobs can now trigger a scheduling iteration (merged from 2.3.8)
e - created a utility module that is shared between both server and mom
but does NOT get placed in the libtorque library
e - allow the user to request a specific processor geometry for their
job using
a bitmap, and then bind their jobs to those processors using cpusets.
b - fix how qsub sets PBS_O_HOST and PBS_SERVER (Eirikur Hjartarson,
deCODE genetics) (merged from 2.3.8)
b - fix to prevent some jobs from getting deleted on startup.
f - add qpool.gz to contrib directory
e - improve how error constants and text messages are represented (Simon
Toth, muni.cz)
f - new boolean queue attribute "is_transit" that allows jobs to exceede
server resource limits (queue limits are respected). This allows routing
queues to route jobs that would be rejected for exceeding local
resources even when the job won't be run locally. (Simon Toth, muni.cz)
e - add support for "job_array" as a type for queue disallowed_types
attribute
e - added pbs_mom config option ignmem to ignore mem/pmem limit enforcement
e - added pbs_mom config option igncput to ignore pcput limit enforcement
2.4.0
f - added a "-q" option to pbs_mom which does *not* perform the default
-p behavior
e - made "pbs_mom -p" the default option when starting pbs_mom
e - added -q to qalter to allow quicker response to modify requests
f - added basic qhold support for job arrays
b - clear out ji_destin in obit_reply
f - add qchkpt command
e - renamed job.h to pbs_job.h
b - fix logic error in checkpoint interval test
f - add RERUNNABLEBYDEFAULT parameter to torque.cfg. allows admin to
change the default value of the job rerunnable attribute from true to false
e - added preliminary Comprehensive System Accounting (CSA)
functionality for Linux. Configure option --enable-csa will cause
workload management records to be written if CSA is installed and wkmg
is turned on.
b - changes to allow post_checkpoint() to run when checkpoint is
completed, not when it has just started. Also corrected issue when
checkpoint fails while trying to put job on hold.
b - update server immediately with changed checkpoint name and time
attributes after successful checkpoint.
e - Changes so checkpoint jobs failing after restarted are put on hold
or requeued
e - Added checkpoint_restart_status job attribute used for restart status
b - Updated manpages for qsub and qterm to reflect changed checkpointing
options.
b - reject a qchkpt request if checkpointing is not enabled for the job
b - Mom should not send checkpoint name and time to server unless
checkpoint was successful
b - fix so that running jobs that have a hold type and that fail on
checkpoint restart get deleted when qdel is used
b - fix so we reset start_time, if needed, when restarting a
checkpointed job
f - added experimental fault_tolerant job attribute (set to true by passing
-f to qsub) this attribute indicates that a job can survive the loss of
a sister mom also added corresponding fault_tolerant and
fault_intolerant types to the "disallowed_types" queue attribute
b - fixes for pbs_moms updating of comment and checkpoint name and time
e - change so we can reject hold requests on running jobs that do not have
checkpoint enabled if system was configured with --enable-blcr
e - change to qsub so only the host name can be specified on the -e/-o
options
e - added -w option to qsub that allows setting of PBS_O_WORKDIR
Ken Nielson
Adaptive Computing
More information about the torquedev
mailing list