[torqueusers] Re: launching GM jobs is too slow

Garrick Staples garrick at usc.edu
Fri Nov 11 03:35:44 MST 2005


On Thu, Nov 10, 2005 at 09:17:42PM -0700, Maestas, Christopher Daniel alleged:
> Garrick,
> 
> In fixing some scaling issues recently with Pete on ib, we found that
> changing the following code in the attached torque patch the pbs_mom
> with launch issues.  I would also suggest testing against the mpiexec in
> cvs as well.  Pete was going to release a new mpiexec rsn ... :-)

Reading mpiexec and mpirun.ch_gm.pl, I see basicly 4 steps to launch an
mpichgm job:

1) Open some ports on the execution node.
   - this is fine with mpiexec and mpirun

2) Execute the binary on every node.
   - ssh1 does this really really fast
   - TM is definitely slower than it could be, but is still reasonable.

3) Read in mapping info from the nodes.
   - mpirun does this at blazing speeds
   - mpiexec sits for tens of seconds in read() calls

4) Send the combined mapping info back to the nodes.
   - mpirun and mpiexec both do this really fast.

Because steps 1, 2, and 4 are functioning correctly, things like
network/gm connectivity, NFS homedirs, pbs_mom, name resolution, etc.
are all working fine.

I just can't come up with a reason why step 3 is a problem.  mpirun and
mpiexec are doing the exact same thing, but one is in Perl and the other
is in C.

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20051111/d62a01a4/attachment.bin


More information about the torqueusers mailing list