[torqueusers] TM interface errors

Garrick Staples garrick at usc.edu
Thu Apr 13 13:21:12 MDT 2006


On Thu, Apr 13, 2006 at 09:49:07AM -0400, Prakash Velayutham alleged:
> Nope. No replies yet. People here seem to hold the idea that TM module 
> in Open MPI captures the single TM connection available. But there are 
> no replies on why the same works from Mother Superior, but does not work 
> from MOMs alone (with or without the TM_MULTIPLE_CONNS patch).

How is your MPI process being launched?  Through rsh/ssh or TM?  You
can't rsh/ssh to a sister and use TM.


I just did some simple tests with 2.1.0.  2.0.0 with the multi_conn
patch should behave the same way.

With nodes=2:ppn=1, 'pbsdsh pbsdsh hostname' works fine.

With nodes=1:ppn=2, it still works.  Even gross abuses like 'pbsdsh
pbsdsh pbsdsh pbsdsh hostname' work fine.

However, this breaks down with nodes=2:ppn=2.  It looks like TM
connections on sisters don't work if you have more than 1 vnode on the
sister.


There are also errors to syslog:
Apr 13 12:11:55 hpcjr0007 pbs_mom: find_node, stream id 1 does not match 2 to node 1 (stream=192.168.0.226:15003 node=192.168.0.226:15003) 
Apr 13 12:11:56 hpcjr0007 pbs_mom: find_node, stream id 1 does not match 2 to node 1 (stream=192.168.0.226:15003 node=192.168.0.226:15003) 
Apr 13 12:11:56 hpcjr0007 pbs_mom: im_request, event 6 taskid 56 not found 
Apr 13 12:11:56 hpcjr0007 pbs_mom: im_request, job 56457.hpcjr-master.usc.edu: command 0 


$ cat $PBS_NODEFILE
hpcjr0007
hpcjr0006
$ pbsdsh pbsdsh pbsdsh pbsdsh pbsdsh bash -c 'echo $(hostname) $PBS_TASKNUM'
hpcjr0007 1014
hpcjr0006 1015
hpcjr0007 1016
hpcjr0007 1018
hpcjr0006 1017
hpcjr0006 1019
hpcjr0007 1028
hpcjr0006 1029
hpcjr0007 1030
hpcjr0007 1032
hpcjr0006 1031
hpcjr0007 1036
hpcjr0006 1033
hpcjr0007 1038
hpcjr0007 1040
hpcjr0006 1037
hpcjr0006 1039
hpcjr0006 1041
hpcjr0007 1044
hpcjr0006 1045
hpcjr0007 1046
hpcjr0007 1048
hpcjr0006 1047
hpcjr0006 1049
hpcjr0007 1050
hpcjr0006 1051
hpcjr0007 1052
hpcjr0007 1054
hpcjr0006 1053
hpcjr0006 1055
hpcjr0007 1056
hpcjr0006 1057

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20060413/dd862291/attachment.bin


More information about the torqueusers mailing list