[torqueusers] RE: launching GM jobs is too slow

Maestas, Christopher Daniel cdmaest at sandia.gov
Thu Nov 10 21:17:42 MST 2005


In fixing some scaling issues recently with Pete on ib, we found that
changing the following code in the attached torque patch the pbs_mom
with launch issues.  I would also suggest testing against the mpiexec in
cvs as well.  Pete was going to release a new mpiexec rsn ... :-)

- Chris

-----Original Message-----
From: mpiexec-bounces at osc.edu [mailto:mpiexec-bounces at osc.edu] On Behalf
Of Garrick Staples
Sent: Thursday, November 10, 2005 8:02 PM
To: mpiexec at osc.edu
Subject: launching GM jobs is too slow

I've been investigating why I'm having a hard time launching
mpichgm-1.2.6..14a jobs with more than 500 CPUs.  It seems that the time
to actually boot MPI is taking too long.

With the mpichgm mpirun perl script using ssh1, I can launch a 1000 CPU
job in a few seconds.  A simple MPI helloworld will launch and complete
in 9 seconds.

Using strace and some strategic printf's tells me that mpiexec spends a
lot of time waiting on the read() inside of
The accept() calls return just fine, but the read() calls can take
several seconds:

     0.000000 accept(4, 0, NULL)        = 6
     0.000000 read(6, "<", 1)           = 1
     2.609772 read(6, "<", 1)           = 1
     0.000000 accept(4, 0, NULL)        = 6
     0.000000 read(6, "<", 1)           = 1
     7.619335 read(6, "<", 1)           = 1
     0.000000 nanosleep({0, 200000000}, NULL) = 0
     0.209981 accept(4, 0, NULL)        = 6
     0.000000 read(6, "<", 1)           = 1
    19.578293 read(6, "<", 1)           = 1

The thing that really baffles me is that mpiexec appears to do almost
the same thing as the mpichgm perl script, but for some reason the perl
script never gets blocked.

I've been futzing with the mpiexec code, trying random things like using
a blocking socket, use recv() instead of read(), etc.  But nothing seems
to cure those slow reads() from the network.

Some actual timings with the latest TORQUE-2.0.0p2, RHEL3 x86_64,
mpichgm-1.2.6..14a, and mpiexec-0.80 on 480 CPUs:

A few hundred trivial TM tasks is fine:
$ time mpiexec -comm none /bin/true
real    0m20.888s

$ time pbsdsh /bin/true
real    0m21.168s

A few hundred TM tasks that live for awhile is fine:
$ time mpiexec -comm none bash -c "\"sleep 60;true\""
real    1m21.453s

$ time pbsdsh bash -c "sleep 60;true"
real    1m21.363s

But an actual GM MPI job is a problem:
$ time mpirun -allcpus ./helloworld
real    0m7.319s

$ time mpiexec ./mpitest/helloworld
real    1m4.664s
... past 500 CPUs I tend to just get "<<<...>>> string not recognized"
(which is <<<ABORT_magic_ABORT>>> from a node that timed out)

$ time mpiexec -v ./mpitest/helloworld
mpiexec: resolve_exe: using absolute exe "./mpitest/helloworld".
mpiexec: All 480 tasks started.
read_gm_startup_ports: waiting for info
read_gm_startup_ports: mpich gm version 12510
read_gm_startup_ports: id 1 port 2 board 96 gm_node_id 0xdd4832d8
  numanode 0 pid 30181 remote_port 14249 ... this part takes over a
minute, just reading in the node info

Garrick Staples, Linux/HPCC Administrator University of Southern
-------------- next part --------------
A non-text attachment was scrubbed...
Name: patch-job_recov.c
Type: application/octet-stream
Size: 385 bytes
Desc: patch-job_recov.c
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20051110/c6846b8a/patch-job_recov-0001.obj

More information about the torqueusers mailing list