[torqueusers] MPI broadcast test fails only when I run within a torque job
Rahul Nabar
rpnabar at gmail.com
Wed Jul 28 17:35:09 MDT 2010
I'm not sure if this is a torque issue or an MPI issue. If I log in to
a compute-node and run the standard mpi broadcast test it returns no
error but if I run it through PBS/Torque I get an error (see below)
The nodes that return the error are fairly random. Even the same set
of nodes will run a test once and then the next time they fail. In
case it matters, these nodes have dual interfaces: 1GigE and 10GigE.
All tests I was trying on the same group of 32 nodes.
If I login to the node (just as a regular user ; not as root) then the
test runs fine. No errors at all.
Is there a timeout somewhere? Or some such issue? Not at all sure why
this is happening....
Things I've verified. ulimit seems ok. I explicitly have set the
ulimit within the pbs init script as well as in the ssh script that
spawns it.
[root at eu013 ~]# grep ulimit /etc/init.d/pbs
ulimit -l unlimited
[root at eu013 ~]# grep ulimit /etc/init.d/sshd
ulimit -l unlimited
ssh eu013 ulimit -l
unlimited
Even if I put a "ulimit -l" in a PBS job it does return unlimited.
"cat /sys/class/infiniband/cxgb3_0/proto_stats/tcpRetransSegs" returns
a zero on all nodes concerned. Even ifconfig does not return any Error
packets.
--
Rahul
#############################################################3
PBS command:
mpirun -mca btl openib,sm,self -mca orte_base_help_aggregate 0
/opt/src/mpitests/imb/src/IMB-MPI1 bcast
-----------------------------through
PBS---------------------------------------------
The RDMA CM returned an event error while attempting to make a
connection. This type of error usually indicates a network
configuration error.
Local host: eu013
Local device: cxgb3_0
Error name: RDMA_CM_EVENT_UNREACHABLE
Peer: eu010
Your MPI job will now abort, sorry.
-------------------------------------------------------------------------
#######################################
Run physically from a compute node
mpirun -host eu001,eu002,eu003,eu004,eu005,eu006,eu007,eu008,eu009,eu010,eu011,eu012,eu013,eu014,eu015,eu016,eu017,eu018,eu019,eu010,eu011,eu012,eu013,eu014,eu015,eu016,eu017,eu018,eu019,eu020,eu021,eu022,eu023,eu024,eu025,eu026,eu027,eu028,eu029,eu030,eu031,eu032
-mca btl openib,sm,self -mca orte_base_help_aggregate 0
/opt/src/mpitests/imb/src/IMB-MPI1 bcast
#----------------------------------------------------------------
# Benchmarking Bcast
# #processes = 42
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.02 0.03 0.02
1 1000 170.70 170.76 170.74
2 1000 171.04 171.10 171.08
4 1000 171.09 171.15 171.13
8 1000 171.05 171.13 171.10
16 1000 171.03 171.10 171.07
32 1000 31.93 32.00 31.98
64 1000 28.86 29.02 28.99
128 1000 29.34 29.40 29.38
256 1000 29.90 29.98 29.95
512 1000 30.39 30.47 30.44
1024 1000 31.59 31.67 31.64
2048 1000 38.15 38.26 38.23
4096 1000 187.59 187.75 187.68
8192 1000 208.26 208.41 208.37
16384 1000 395.47 395.71 395.61
32768 1000 9360.99 9441.36 9416.47
65536 400 10522.09 11003.08 10781.73
131072 299 16971.71 17647.29 17329.27
262144 160 15404.01 17131.36 15816.46
524288 80 2659.56 4258.90 3002.04
1048576 40 4305.72 5305.33 5219.00
2097152 20 2472.34 10711.80 8599.28
4194304 10 6275.51 20791.20 13687.10
# All processes entering MPI_Finalize
More information about the torqueusers
mailing list