[torqueusers] Re: launching GM jobs is too slow
Garrick Staples
garrick at usc.edu
Fri Nov 11 03:35:44 MST 2005
On Thu, Nov 10, 2005 at 09:17:42PM -0700, Maestas, Christopher Daniel alleged:
> Garrick,
>
> In fixing some scaling issues recently with Pete on ib, we found that
> changing the following code in the attached torque patch the pbs_mom
> with launch issues. I would also suggest testing against the mpiexec in
> cvs as well. Pete was going to release a new mpiexec rsn ... :-)
Reading mpiexec and mpirun.ch_gm.pl, I see basicly 4 steps to launch an
mpichgm job:
1) Open some ports on the execution node.
- this is fine with mpiexec and mpirun
2) Execute the binary on every node.
- ssh1 does this really really fast
- TM is definitely slower than it could be, but is still reasonable.
3) Read in mapping info from the nodes.
- mpirun does this at blazing speeds
- mpiexec sits for tens of seconds in read() calls
4) Send the combined mapping info back to the nodes.
- mpirun and mpiexec both do this really fast.
Because steps 1, 2, and 4 are functioning correctly, things like
network/gm connectivity, NFS homedirs, pbs_mom, name resolution, etc.
are all working fine.
I just can't come up with a reason why step 3 is a problem. mpirun and
mpiexec are doing the exact same thing, but one is in Perl and the other
is in C.
--
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20051111/d62a01a4/attachment.bin
More information about the torqueusers
mailing list