Ken Nielson knielson at adaptivecomputing.com
Thu Apr 22 12:17:35 MDT 2010

We have added a new branch to the TORQUE repository. The new branch is 
3.0-alpha and as the name implies this is alpha code.

Currently the two main new features are multi-mom which allows more than 
one copy of pbs_mom to run from the same node and in the same cluster.

The next major feature is a job_radix. When MOMs are allocated for a job 
the job_radix lets the user select a value from 2 to 9999 for a radix 
size. The Mother Superior will then only contact the number of sisters 
designated by the job_radix to join a job. Those sisters will in turn 
contact x number of sisters until all sister have been included in the 
job. Subordinate sisters now communicate only with the sister that 
contacted them including sending stdout and stderr to pbs_demux. This 
feature allows TORQUE to be able to scale well beyond its current 
capacity. It also looks like the distribution of communication across a 
hardware cluster may also reduce network communication jams at switches 
as well. Another possibility the job_radix offers is the ability to 
replace rpp with tcp for communication creating a more standard and 
reliable method of communication.

By default this version of TORQUE will still wire-up sisters in a job 
the way TORQUE always has. To invoke the job_radix you ask for it in the 
qsub request with a -W option.

example: qsub -l nodes=4500 -W job_radix=4 job.sh

The example above will allocate 4500 nodes to a job with each mom 
contacting 4 other moms in the job, starting with the mother superior. 
The Mother Superior is still a single head node for the job with all 
sisters eventually reporting to the Mother Superior through each 
intermediate MOM.

Let me know if you have questions. The code does run. We were able to 
get it to work on a 3000 plus node cluster. But I am sure there is much 
to flesh out.

Ken Nielson
Adaptive Computing

