[torquedev] TORQUE 2.4 is live

Ken Nielson knielson at adaptivecomputing.com
Mon Nov 2 10:41:20 MST 2009


TORQUE version 2.4 was officially released on Thursday, October 29, 
2009. It will be available for download at 
http://www.clusterresources.com/downloads/torque/torque-2.4.tar.gz

A new branch has been created in the subversion tree under 
torque/branches/2.4-fixes. This is where bug fixes are to be made for 
TORQUE 2.4.

Also see http://www.clusterresources.com/products/torque/docs/ for 
updated TORQUE documentation.

Some of the feature highlights are improved high availability, job 
arrays, improved job check pointing, per job epilogue and prologue 
scripts, and service jobs. Below is the CHANGELOG for this build.

Please note if you are running TORQUE 2.3.x and do not need any of the 
new features provided in 2.4 do not feel obligated to upgrade. 2.3.x 
will continue to be supported.


c - crash  b - bug fix   e - enhancement   f - new feature

2.4.2

b - Added pbs_error_db.h to src/include/Makefile.am and 
src/include/Makefile.in.
pbs_error_db.h now needed for install.

e - Modified pbs_get_server_list so the $TORQUE_HOME/server_name file 
will work with

a comma delimited string or a list of server names separated by a new line.

b - fix tracejob so it handles multiple server and mom logs for the same 
day

f - Added a new server parameter np_default. This allows the 
administrator to
change the number of processors to a unified value dynamically for the
entire cluster.

e - high availability enhanced so that the server spawns a separate 
thread to
update the "lock" on the lockfile. Thread update and check time are both
setable parameters in qmgr.

b - close empty ACL files

2.4.1

e - added a prologue and epilogue option to the list of resources for 
qsub -l
which allows a per job prologue or epilogue script. The syntax for
the new option is qsub -l prologue=<prologue script>,epilogue=<epilogue 
script>

f - added a "-w" option to qsub to override the working directory

e - changes needed to allow relocatable checkpoint jobs. Job checkpoint 
files
are now under the control of the server.

c - check filename for NULL to prevent crash

b - changed so we don't try to copy a local file when the destination is 
a directory and the file is already in that directory

f - changes to allow TORQUE to operate without pbs_iff (merged from 2.3)

e - made logging functions rentrant safe by using localtime_r instead of 
localtime() (merged from 2.3)

e - Merged in more logging and NOSIGCHLDMOM capability from Yahoo branch

e - merged in new log_ext() function to allow more fine grained syslog 
events, you can now specify severity level. Also added more logging 
statements

b - fixed a bug where CPU time was not being added up properly in all 
cases. (fix for Linux only)

c - fixed a few memory errors due to some uninitialized memory being 
allocated. (ported from 2.3 R2493)

e - added code to allow compilers to override CLONE_BATCH_SIZE at 
configure time (allows for finer grained control on how arrays are 
created) (ported from Yahoo R2461)

e - added code which prefixes the severity tag on all log_ext() and 
log_err() messages (ported from Yahoo R2358)

f - added code from 2.3-extreme that allows TORQUE to handle more than 
1024 sockets. Also, increased the size of TORQUE's internal socket 
handle table to avoid running out of handles under busy conditions.

e - TORQUE can now handle server names larger than 64 bytes (now set to 
1024, which should be larger than the max for hostnames)

e - added qmgr option accounting_keep_days, specifies how long to keep 
accounting files.

e - changed mom config varattr so invoked script returns the varattr 
name and value(s)

e - improved the performance of pbs_server when submitting large numbers 
of jobs with dependencies defined

e - added new parameter "log_keep_days" to both pbs_server and pbs_mom. 
Specifies how long to keep log files before they are automatically removed

e - added qmgr server attribute lock_file, specifies where server lock 
file is located

b - change so we use default file name for output / error file when just 
a directory is specified on qsub / qalter -e -o options

e - modified to allow retention of completed jobs across server shutdown

e - added job_must_report qmgr configuration which says the job must be 
reported to scheduler. Added job attribute "reported". Added PURGECOMP 
functionality which allows scheduler to confirm jobs are reported. Also 
added -c option to qdel. Used to clean up unreported jobs.

b - Fix so interactive jobs run when using $job_output_file_umask 
userdefault

f - Allow adding extra End accounting record for a running job that is 
rerun. Provides usage data. Enabled by CFLAGS=-DRERUNUSAGE.

b - Fix to use queue/server resources_defaults to validate mppnodect 
against resources_max when mppwidth or mppnppn are not specified for job

f - merged in new dynamic array struct and functions to implement a new 
(and more efficient) way of loading jobs at startup--should help by 2 
orders of magnitude!

f - changed TORQUE_MAXCONNECTTIMEOUT to be a global variable that is now 
changed by the MOM to be smaller than the pbs_server and is also 
configurable on the MOM ($max_conn_timeout_micro_sec)

e - change so queued jobs that get deleted go to complete and get 
displayed in qstat based on keep_completed

b - Changes to improve the qstat -x XML output and documentation

b - Change so BATCH_PARTITION_ID does not pass through to child jobs

c - fix to prevent segfault on pbs_server -t cold

b - fix so find_resc_entry still works after setting server extra_resc

c - keep pbs_server from trying to free empty attrlist after recieving

bad request (Michael Meier, University of Erlangen-Nurnberg) (merged 
from 2.3.8)

f - new fifo scheduler config option. ignore_queue: queue_name allows 
the scheduler to be instructed to ignore up to 16 queues on the server 
(Simon Toth, muni.cz)

e - add administrator customizable email notifications (see manpage for 
pbs_server_attributes) - (Roland Haas, Georgia Tech)

e - moving jobs can now trigger a scheduling iteration (merged from 2.3.8)

e - created a utility module that is shared between both server and mom 
but does NOT get placed in the libtorque library

e - allow the user to request a specific processor geometry for their 
job using

a bitmap, and then bind their jobs to those processors using cpusets.

b - fix how qsub sets PBS_O_HOST and PBS_SERVER (Eirikur Hjartarson, 
deCODE genetics) (merged from 2.3.8)

b - fix to prevent some jobs from getting deleted on startup.

f - add qpool.gz to contrib directory

e - improve how error constants and text messages are represented (Simon 
Toth, muni.cz)

f - new boolean queue attribute "is_transit" that allows jobs to exceede 
server resource limits (queue limits are respected). This allows routing 
queues to route jobs that would be rejected for exceeding local 
resources even when the job won't be run locally. (Simon Toth, muni.cz)

e - add support for "job_array" as a type for queue disallowed_types 
attribute

e - added pbs_mom config option ignmem to ignore mem/pmem limit enforcement

e - added pbs_mom config option igncput to ignore pcput limit enforcement


2.4.0

f - added a "-q" option to pbs_mom which does *not* perform the default 
-p behavior

e - made "pbs_mom -p" the default option when starting pbs_mom

e - added -q to qalter to allow quicker response to modify requests

f - added basic qhold support for job arrays

b - clear out ji_destin in obit_reply

f - add qchkpt command

e - renamed job.h to pbs_job.h

b - fix logic error in checkpoint interval test

f - add RERUNNABLEBYDEFAULT parameter to torque.cfg. allows admin to 
change the default value of the job rerunnable attribute from true to false

e - added preliminary Comprehensive System Accounting (CSA) 
functionality for Linux. Configure option --enable-csa will cause 
workload management records to be written if CSA is installed and wkmg 
is turned on.

b - changes to allow post_checkpoint() to run when checkpoint is 
completed, not when it has just started. Also corrected issue when 
checkpoint fails while trying to put job on hold.

b - update server immediately with changed checkpoint name and time 
attributes after successful checkpoint.

e - Changes so checkpoint jobs failing after restarted are put on hold 
or requeued

e - Added checkpoint_restart_status job attribute used for restart status

b - Updated manpages for qsub and qterm to reflect changed checkpointing 
options.

b - reject a qchkpt request if checkpointing is not enabled for the job

b - Mom should not send checkpoint name and time to server unless 
checkpoint was successful

b - fix so that running jobs that have a hold type and that fail on 
checkpoint restart get deleted when qdel is used

b - fix so we reset start_time, if needed, when restarting a 
checkpointed job

f - added experimental fault_tolerant job attribute (set to true by passing

-f to qsub) this attribute indicates that a job can survive the loss of 
a sister mom also added corresponding fault_tolerant and 
fault_intolerant types to the "disallowed_types" queue attribute

b - fixes for pbs_moms updating of comment and checkpoint name and time

e - change so we can reject hold requests on running jobs that do not have

checkpoint enabled if system was configured with --enable-blcr

e - change to qsub so only the host name can be specified on the -e/-o 
options

e - added -w option to qsub that allows setting of PBS_O_WORKDIR


Ken Nielson
Adaptive Computing



More information about the torquedev mailing list