[torquedev] [torqueusers] TORQUE 2.4 is live.

Ken Nielson knielson at adaptivecomputing.com
Mon Nov 2 11:21:32 MST 2009


Garrick,

This is why we have a community. To keep those of us who are new on our 
toes. The numbering scheme will be the same.

Major.Minor.revision. I neglected to add a revision to this release.

Is there a problem with assuming this is 2.4.0 and future revisions 
starting at 2.4.1.

Ken Nielson
Adaptive Computing


Garrick Staples wrote:
> The tarball doesn't have a release minor; so what is the new numbering scheme?
>
> On Mon, Nov 02, 2009 at 10:03:02AM -0700, Ken Nielson alleged:
>   
>> TORQUE version 2.4 was officially released on Thursday, October 29, 
>> 2009. It will be available for download at 
>> http://www.clusterresources.com/downloads/torque/torque-2.4.tar.gz
>>
>> A new branch has been created in the subversion tree under 
>> torque/branches/2.4-fixes. This is where bug fixes are to be made for 
>> TORQUE 2.4.
>>
>> Also see http://www.clusterresources.com/products/torque/docs/ for 
>> updated TORQUE documentation.
>>
>> Some of the feature highlights are improved high availability, job 
>> arrays, improved job check pointing, per job epilogue and prologue 
>> scripts, and service jobs. Below is the CHANGELOG for this build.
>>
>> Please note if you are running TORQUE 2.3.x and do not need any of the 
>> new features provided in 2.4 do not feel obligated to upgrade. 2.3.x 
>> will continue to be supported.
>>
>>
>> c - crash  b - bug fix   e - enhancement   f - new feature
>>
>> 2.4.2
>>
>> b - Added pbs_error_db.h to src/include/Makefile.am and 
>> src/include/Makefile.in.
>> pbs_error_db.h now needed for install.
>>
>> e - Modified pbs_get_server_list so the $TORQUE_HOME/server_name file 
>> will work with
>>
>> a comma delimited string or a list of server names separated by a new line.
>>
>> b - fix tracejob so it handles multiple server and mom logs for the same 
>> day
>>
>> f - Added a new server parameter np_default. This allows the 
>> administrator to
>> change the number of processors to a unified value dynamically for the
>> entire cluster.
>>
>> e - high availability enhanced so that the server spawns a separate 
>> thread to
>> update the "lock" on the lockfile. Thread update and check time are both
>> setable parameters in qmgr.
>>
>> b - close empty ACL files
>>
>> 2.4.1
>>
>> e - added a prologue and epilogue option to the list of resources for 
>> qsub -l
>> which allows a per job prologue or epilogue script. The syntax for
>> the new option is qsub -l prologue=<prologue script>,epilogue=<epilogue 
>> script>
>>
>> f - added a "-w" option to qsub to override the working directory
>>
>> e - changes needed to allow relocatable checkpoint jobs. Job checkpoint 
>> files
>> are now under the control of the server.
>>
>> c - check filename for NULL to prevent crash
>>
>> b - changed so we don't try to copy a local file when the destination is 
>> a directory and the file is already in that directory
>>
>> f - changes to allow TORQUE to operate without pbs_iff (merged from 2.3)
>>
>> e - made logging functions rentrant safe by using localtime_r instead of 
>> localtime() (merged from 2.3)
>>
>> e - Merged in more logging and NOSIGCHLDMOM capability from Yahoo branch
>>
>> e - merged in new log_ext() function to allow more fine grained syslog 
>> events, you can now specify severity level. Also added more logging 
>> statements
>>
>> b - fixed a bug where CPU time was not being added up properly in all 
>> cases. (fix for Linux only)
>>
>> c - fixed a few memory errors due to some uninitialized memory being 
>> allocated. (ported from 2.3 R2493)
>>
>> e - added code to allow compilers to override CLONE_BATCH_SIZE at 
>> configure time (allows for finer grained control on how arrays are 
>> created) (ported from Yahoo R2461)
>>
>> e - added code which prefixes the severity tag on all log_ext() and 
>> log_err() messages (ported from Yahoo R2358)
>>
>> f - added code from 2.3-extreme that allows TORQUE to handle more than 
>> 1024 sockets. Also, increased the size of TORQUE's internal socket 
>> handle table to avoid running out of handles under busy conditions.
>>
>> e - TORQUE can now handle server names larger than 64 bytes (now set to 
>> 1024, which should be larger than the max for hostnames)
>>
>> e - added qmgr option accounting_keep_days, specifies how long to keep 
>> accounting files.
>>
>> e - changed mom config varattr so invoked script returns the varattr 
>> name and value(s)
>>
>> e - improved the performance of pbs_server when submitting large numbers 
>> of jobs with dependencies defined
>>
>> e - added new parameter "log_keep_days" to both pbs_server and pbs_mom. 
>> Specifies how long to keep log files before they are automatically removed
>>
>> e - added qmgr server attribute lock_file, specifies where server lock 
>> file is located
>>
>> b - change so we use default file name for output / error file when just 
>> a directory is specified on qsub / qalter -e -o options
>>
>> e - modified to allow retention of completed jobs across server shutdown
>>
>> e - added job_must_report qmgr configuration which says the job must be 
>> reported to scheduler. Added job attribute "reported". Added PURGECOMP 
>> functionality which allows scheduler to confirm jobs are reported. Also 
>> added -c option to qdel. Used to clean up unreported jobs.
>>
>> b - Fix so interactive jobs run when using $job_output_file_umask 
>> userdefault
>>
>> f - Allow adding extra End accounting record for a running job that is 
>> rerun. Provides usage data. Enabled by CFLAGS=-DRERUNUSAGE.
>>
>> b - Fix to use queue/server resources_defaults to validate mppnodect 
>> against resources_max when mppwidth or mppnppn are not specified for job
>>
>> f - merged in new dynamic array struct and functions to implement a new 
>> (and more efficient) way of loading jobs at startup--should help by 2 
>> orders of magnitude!
>>
>> f - changed TORQUE_MAXCONNECTTIMEOUT to be a global variable that is now 
>> changed by the MOM to be smaller than the pbs_server and is also 
>> configurable on the MOM ($max_conn_timeout_micro_sec)
>>
>> e - change so queued jobs that get deleted go to complete and get 
>> displayed in qstat based on keep_completed
>>
>> b - Changes to improve the qstat -x XML output and documentation
>>
>> b - Change so BATCH_PARTITION_ID does not pass through to child jobs
>>
>> c - fix to prevent segfault on pbs_server -t cold
>>
>> b - fix so find_resc_entry still works after setting server extra_resc
>>
>> c - keep pbs_server from trying to free empty attrlist after recieving
>>
>> bad request (Michael Meier, University of Erlangen-Nurnberg) (merged 
>> from 2.3.8)
>>
>> f - new fifo scheduler config option. ignore_queue: queue_name allows 
>> the scheduler to be instructed to ignore up to 16 queues on the server 
>> (Simon Toth, muni.cz)
>>
>> e - add administrator customizable email notifications (see manpage for 
>> pbs_server_attributes) - (Roland Haas, Georgia Tech)
>>
>> e - moving jobs can now trigger a scheduling iteration (merged from 2.3.8)
>>
>> e - created a utility module that is shared between both server and mom 
>> but does NOT get placed in the libtorque library
>>
>> e - allow the user to request a specific processor geometry for their 
>> job using
>>
>> a bitmap, and then bind their jobs to those processors using cpusets.
>>
>> b - fix how qsub sets PBS_O_HOST and PBS_SERVER (Eirikur Hjartarson, 
>> deCODE genetics) (merged from 2.3.8)
>>
>> b - fix to prevent some jobs from getting deleted on startup.
>>
>> f - add qpool.gz to contrib directory
>>
>> e - improve how error constants and text messages are represented (Simon 
>> Toth, muni.cz)
>>
>> f - new boolean queue attribute "is_transit" that allows jobs to exceede 
>> server resource limits (queue limits are respected). This allows routing 
>> queues to route jobs that would be rejected for exceeding local 
>> resources even when the job won't be run locally. (Simon Toth, muni.cz)
>>
>> e - add support for "job_array" as a type for queue disallowed_types 
>> attribute
>>
>> e - added pbs_mom config option ignmem to ignore mem/pmem limit enforcement
>>
>> e - added pbs_mom config option igncput to ignore pcput limit enforcement
>>
>>
>> 2.4.0
>>
>> f - added a "-q" option to pbs_mom which does *not* perform the default 
>> -p behavior
>>
>> e - made "pbs_mom -p" the default option when starting pbs_mom
>>
>> e - added -q to qalter to allow quicker response to modify requests
>>
>> f - added basic qhold support for job arrays
>>
>> b - clear out ji_destin in obit_reply
>>
>> f - add qchkpt command
>>
>> e - renamed job.h to pbs_job.h
>>
>> b - fix logic error in checkpoint interval test
>>
>> f - add RERUNNABLEBYDEFAULT parameter to torque.cfg. allows admin to 
>> change the default value of the job rerunnable attribute from true to false
>>
>> e - added preliminary Comprehensive System Accounting (CSA) 
>> functionality for Linux. Configure option --enable-csa will cause 
>> workload management records to be written if CSA is installed and wkmg 
>> is turned on.
>>
>> b - changes to allow post_checkpoint() to run when checkpoint is 
>> completed, not when it has just started. Also corrected issue when 
>> checkpoint fails while trying to put job on hold.
>>
>> b - update server immediately with changed checkpoint name and time 
>> attributes after successful checkpoint.
>>
>> e - Changes so checkpoint jobs failing after restarted are put on hold 
>> or requeued
>>
>> e - Added checkpoint_restart_status job attribute used for restart status
>>
>> b - Updated manpages for qsub and qterm to reflect changed checkpointing 
>> options.
>>
>> b - reject a qchkpt request if checkpointing is not enabled for the job
>>
>> b - Mom should not send checkpoint name and time to server unless 
>> checkpoint was successful
>>
>> b - fix so that running jobs that have a hold type and that fail on 
>> checkpoint restart get deleted when qdel is used
>>
>> b - fix so we reset start_time, if needed, when restarting a 
>> checkpointed job
>>
>> f - added experimental fault_tolerant job attribute (set to true by passing
>>
>> -f to qsub) this attribute indicates that a job can survive the loss of 
>> a sister mom also added corresponding fault_tolerant and 
>> fault_intolerant types to the "disallowed_types" queue attribute
>>
>> b - fixes for pbs_moms updating of comment and checkpoint name and time
>>
>> e - change so we can reject hold requests on running jobs that do not have
>>
>> checkpoint enabled if system was configured with --enable-blcr
>>
>> e - change to qsub so only the host name can be specified on the -e/-o 
>> options
>>
>> e - added -w option to qsub that allows setting of PBS_O_WORKDIR
>>
>>
>> Ken Nielson
>> Adaptive Computing
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>     
>
>   
> ------------------------------------------------------------------------
>
> _______________________________________________
> torquedev mailing list
> torquedev at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torquedev
>   



More information about the torquedev mailing list