[torqueusers] tm aware ssh

David Roman David.Roman at noveltis.fr
Fri Dec 14 09:13:26 MST 2012


Hello,

I use this script with mpirun of intel.
But it doesn't work with me.

#!/bin/bash
# $Id: pbsssh 2236 2012-05-02 03:16:17Z wil240 $
# $HeadURL: svn+ssh://stream/cs/home/svn/sysadmin/ascutils/common/pbsssh $

usage="usage: $0 <node name> <command>"

#swallow -x -n and -q (for intel mpi)
while getopts "xqn" opt
do
:
done
shift $((OPTIND-1))

if [ $# -lt 2 ]
then
echo $usage
exit
fi

node=$1

shift

exec pbsdsh -v -o -h $node "$@"



I open an interactive session 
qsub -I -l nodes=32

I launch my code with (I 'm on node12)
mpirun  -bootstrap-exec pbsssh -genv I_MPI_FABRICS_LIST tmi ./wrf.exe


And I have these messages
pbsdsh(): rescinfo from 0: Linux hpc-node12 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 16: Linux hpc-node11 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 17: Linux hpc-node11 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 18: Linux hpc-node11 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 19: Linux hpc-node11 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 20: Linux hpc-node11 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 21: Linux hpc-node11 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 22: Linux hpc-node11 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 23: Linux hpc-node11 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 24: Linux hpc-node11 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 25: Linux hpc-node11 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 26: Linux hpc-node11 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 27: Linux hpc-node11 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 28: Linux hpc-node11 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 29: Linux hpc-node11 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 30: Linux hpc-node11 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 31: Linux hpc-node11 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 1: Linux hpc-node12 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 2: Linux hpc-node12 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 3: Linux hpc-node12 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 4: Linux hpc-node12 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 5: Linux hpc-node12 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 6: Linux hpc-node12 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 7: Linux hpc-node12 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 8: Linux hpc-node12 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 9: Linux hpc-node12 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 10: Linux hpc-node12 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 11: Linux hpc-node12 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 12: Linux hpc-node12 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 13: Linux hpc-node12 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 14: Linux hpc-node12 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 15: Linux hpc-node12 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): spawned task 16
pbsdsh(): spawn event returned: 16 (1 spawns and 0 obits outstanding)
pbsdsh(): sending obit for task 3
pbsdsh(): Event poll failed, error TM_ENOTCONNECTED
 starting wrf task           13  of           32
 starting wrf task            1  of           32
 starting wrf task            8  of           32
 starting wrf task            0  of           32
 starting wrf task            9  of           32
 starting wrf task           27  of           32
 starting wrf task           16  of           32
 starting wrf task           10  of           32
 starting wrf task           25  of           32
 starting wrf task           29  of           32
 starting wrf task           30  of           32
 starting wrf task           31  of           32
 starting wrf task           21  of           32
 starting wrf task           17  of           32
 starting wrf task           24  of           32
 starting wrf task           18  of           32
 starting wrf task           22  of           32
 starting wrf task            5  of           32
 starting wrf task           15  of           32
 starting wrf task            6  of           32
 starting wrf task            7  of           32
 starting wrf task            2  of           32
 starting wrf task            3  of           32
 starting wrf task           11  of           32
 starting wrf task           14  of           32
 starting wrf task           20  of           32
 starting wrf task           19  of           32
 starting wrf task           12  of           32
 starting wrf task           26  of           32
 starting wrf task            4  of           32
 starting wrf task           28  of           32
 starting wrf task           23  of           32
pbsdsh(): reconnected
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): sending obit for task 3
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): Event poll failed, error TM_ENOTCONNECTED
pbsdsh(): reconnected
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): sending obit for task 3
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
etc


Some body can said me why ?


Thank you everybody


-----Message d'origine-----
De : torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] De la part de Brock Palen
Envoyé : vendredi 14 décembre 2012 16:03
À : Torque Users Mailing List
Objet : Re: [torqueusers] tm aware ssh

Thanks Gareth,


Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
brockp at umich.edu
(734)936-1985



On Dec 13, 2012, at 6:18 PM, <Gareth.Williams at csiro.au> <Gareth.Williams at csiro.au> wrote:

>> I have a few applications that spawn using ssh and don't support tm,
>> 
>> There was once on the list a 'pbsssh'  that wrapped pbsdsh to act 
>> like ssh,
>> 
>> It looks like that script no longer works and I am scratching my head 
>> as to getting it working again  (my bash fu is weak).
>> 
>> Does anyone already have a way they do this?
>> 
>> The hope is to get correct process reporting, and cleanup for 
>> applications that don't support TM but spawn with ssh/rsh.
>> 
>> 
>> Thanks!
>> 
>> Brock Palen
> 
> Hi Brock,
> 
> We have this (below) in place.  You might need to swallow more ssh options if they are present - and there is no checking that "node" actually gets set to a cluster node name.
> Specifying a user would break this... but you would not expect that in a cluster inside a batch job, right?
> 
> Gareth
> 
> 
> wil240 at burnet-login:~> cat /apps/ascutils/bin/pbsssh #!/bin/bash # 
> $Id: pbsssh 2236 2012-05-02 03:16:17Z wil240 $ # $HeadURL: 
> svn+ssh://stream/cs/home/svn/sysadmin/ascutils/common/pbsssh $
> 
> usage="usage: $0 <node name> <command>"
> 
> #swallow -x -n and -q (for intel mpi)
> while getopts "xqn" opt
> do
> :
> done
> shift $((OPTIND-1))
> 
> if [ $# -lt 2 ]
> then
>        echo $usage
>        exit
> fi
> 
> node=$1
> 
> shift
> 
> exec pbsdsh -o -h $node "$@"
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list