[torqueusers] TORQUE 4.0 Officially Announced
DuChene, StevenX A
stevenx.a.duchene at intel.com
Fri Mar 16 20:26:53 MDT 2012
It is unclear from this announcement text where hwloc has to be installed.
Is it just on the server or on the nodes only?
I looked in the various README files and the Release_Notes file packages with the sources and there is no mention of hwloc in those at all. There is only the one short mention in the CHANGELOG file that is even less than what is in the announcement below.
More documentation about this would be greatly appreciated.
From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of David Beer
Sent: Tuesday, March 13, 2012 12:43 PM
To: Torque Users Mailing List; Torque Developers mailing list
Subject: [torqueusers] TORQUE 4.0 Officially Announced
TORQUE 4.0 is officially here! Please check out Adaptive Computing's official announcement here: http://www.adaptivecomputing.com/adaptive-computing-offers-the-next-generation-of-high-performance-computing-with-moab-hpc-suite-7-0/
The tarball can be downloaded from here: http://www.adaptivecomputing.com/resources/downloads/torque/torque-4.0.0.tar.gz
We have several sites currently using 4.0 and feedback has been positive. These warnings are posted on the download site, but I am copying them here:
1. Make sure that you have openssl-devel (RedHat based) / libssl-dev (Debian based) installed (the name may differ for different operating systems) in order to be able to build TORQUE 4.0.
2. Make sure that you run the daemon trqauthd on machines that will be running client commands. NOTE: there is an init.d script for it in contrib/init.d/ but it needs customization (this includes Moab). One problem is that it has a misspelling for PBS_DAEMON - it should be /usr/local/sbin/trqauthd by default, not /usr/local/bin/trqauthd.
3. Moab needs to be started or restarted after installing TORQUE 4.0 (if you are using Moab)
Please make sure to take all normal precautions for upgrading. Another advisory (not on the website) is that TORQUE now uses hwloc to manage cpusets, meaning you will need to install hwloc on your system if it isn't already there and you wish to use it. It needs to be version 1.1 or higher.
The major features of the release are briefly described on the release, but the CHANGELOG for 4.0 is copied at the end of this email.
This release has undergone more testing than any previous release of TORQUE; to be fair, it also has more changes than any previous version of TORQUE. Overall, we saw very good results in our beta program and most of the sites using it have had good experiences. We are proud of the quality of this release and hope that you'll try it out and let us know how it works for you.
David Beer | Software Engineer
e - make a threadpool for TORQUE server. The number of threads is
customizable using min_threads and max_threads, and idle time before
exiting can be set using thread_idle_seconds.
e - make pbs_server multi-threaded in order to increase responsiveness and scalability.
e - remove the forking from pbs_server running a job, the thread handling the request just
waits until the job is run.
e - change qdel to simply send qdel all - previously this was executed by a qstat and a qdel
of every individual job
e - no longer fork to send mail, just use a thread
e - use hwloc as the backbone for cpuset support in TORQUE (contributed by Dr. Bernd Kallies)
e - add the boolean variable $use_smt to mom config. If set to false, this skips logical
cores and uses only physical cores for the job. It is true by default.
(contributed by Dr. Bernd Kallies)
n - with the multi-threading the pbs_server -t create and -t cold commands could no longer
ask for user input from the command line. The call to ask if the user wants to continue
was moved higher in the initialization process and some of the wording changed to
reflect what is now happening.
e - if cpusets are configured but aren't found and cannot be mounted, pbs_mom will now fail to
start instead of failing silently.
e - Change node_spec from an N^2 (but average 5N) algorithm to an N algorithm with respect
to nodes. We only loop over each node once at a maximum.
e - Abandon pbs_iff in favor of trqauthd. trqauthd is a daemon to be started once that can
perform pbs_iff's functionality, increasing speed and enabling future security
e - add mom_hierarchy functionality for reporting. The file is located in
<TORQUE_HOME>/server_priv/mom_hierarchy, and can be written to tell moms to send
updates to other moms who will pass them on to pbs_server. See docs for details
e - add a unit testing framework (check). It is compiled with --with-check and tests
are executed using make check. The framework is complete but not many tests have
been written as of yet.
e - Mom rejection messages are now passed back to qrun when possible
e - Added the option -c for startup. By default, the server attempts to send the mom
hierarchy file to all moms on startup, and all moms update the server and request
the hierarchy file. If both are trying to do this at once, it can cause a lot of
traffic. -c tells pbs_server to wait 10 minutes to attempt to contact moms that
haven't contacted it, reducing this traffic.
e - Added mom parameter -w to reduce start times. This parameter wait to send it's
first update until the server sends it the mom hierarchy file, or until 10
minutes have passed. This should reduce large cluster startup times.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the torqueusers