[torquedev] 3.0-alpha branch added to TORQUE subversion tree
Ken Nielson
knielson at adaptivecomputing.com
Thu Apr 22 12:17:35 MDT 2010
We have added a new branch to the TORQUE repository. The new branch is
3.0-alpha and as the name implies this is alpha code.
Currently the two main new features are multi-mom which allows more than
one copy of pbs_mom to run from the same node and in the same cluster.
The next major feature is a job_radix. When MOMs are allocated for a job
the job_radix lets the user select a value from 2 to 9999 for a radix
size. The Mother Superior will then only contact the number of sisters
designated by the job_radix to join a job. Those sisters will in turn
contact x number of sisters until all sister have been included in the
job. Subordinate sisters now communicate only with the sister that
contacted them including sending stdout and stderr to pbs_demux. This
feature allows TORQUE to be able to scale well beyond its current
capacity. It also looks like the distribution of communication across a
hardware cluster may also reduce network communication jams at switches
as well. Another possibility the job_radix offers is the ability to
replace rpp with tcp for communication creating a more standard and
reliable method of communication.
By default this version of TORQUE will still wire-up sisters in a job
the way TORQUE always has. To invoke the job_radix you ask for it in the
qsub request with a -W option.
example: qsub -l nodes=4500 -W job_radix=4 job.sh
The example above will allocate 4500 nodes to a job with each mom
contacting 4 other moms in the job, starting with the mother superior.
The Mother Superior is still a single head node for the job with all
sisters eventually reporting to the Mother Superior through each
intermediate MOM.
Let me know if you have questions. The code does run. We were able to
get it to work on a 3000 plus node cluster. But I am sure there is much
to flesh out.
Ken Nielson
Adaptive Computing
More information about the torquedev
mailing list