[torqueusers] tm aware ssh
David Roman
David.Roman at noveltis.fr
Fri Dec 14 09:13:26 MST 2012
Hello,
I use this script with mpirun of intel.
But it doesn't work with me.
#!/bin/bash
# $Id: pbsssh 2236 2012-05-02 03:16:17Z wil240 $
# $HeadURL: svn+ssh://stream/cs/home/svn/sysadmin/ascutils/common/pbsssh $
usage="usage: $0 <node name> <command>"
#swallow -x -n and -q (for intel mpi)
while getopts "xqn" opt
do
:
done
shift $((OPTIND-1))
if [ $# -lt 2 ]
then
echo $usage
exit
fi
node=$1
shift
exec pbsdsh -v -o -h $node "$@"
I open an interactive session
qsub -I -l nodes=32
I launch my code with (I 'm on node12)
mpirun -bootstrap-exec pbsssh -genv I_MPI_FABRICS_LIST tmi ./wrf.exe
And I have these messages
pbsdsh(): rescinfo from 0: Linux hpc-node12 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 16: Linux hpc-node11 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 17: Linux hpc-node11 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 18: Linux hpc-node11 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 19: Linux hpc-node11 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 20: Linux hpc-node11 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 21: Linux hpc-node11 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 22: Linux hpc-node11 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 23: Linux hpc-node11 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 24: Linux hpc-node11 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 25: Linux hpc-node11 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 26: Linux hpc-node11 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 27: Linux hpc-node11 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 28: Linux hpc-node11 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 29: Linux hpc-node11 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 30: Linux hpc-node11 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 31: Linux hpc-node11 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 1: Linux hpc-node12 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 2: Linux hpc-node12 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 3: Linux hpc-node12 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 4: Linux hpc-node12 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 5: Linux hpc-node12 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 6: Linux hpc-node12 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 7: Linux hpc-node12 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 8: Linux hpc-node12 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 9: Linux hpc-node12 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 10: Linux hpc-node12 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 11: Linux hpc-node12 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 12: Linux hpc-node12 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 13: Linux hpc-node12 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 14: Linux hpc-node12 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): rescinfo from 15: Linux hpc-node12 2.6.32-lustre-1.8.5-2 #1 SMP Fri Oct 28 14:15:31 CEST 2011 x86_64:nodes=32
pbsdsh(): spawned task 16
pbsdsh(): spawn event returned: 16 (1 spawns and 0 obits outstanding)
pbsdsh(): sending obit for task 3
pbsdsh(): Event poll failed, error TM_ENOTCONNECTED
starting wrf task 13 of 32
starting wrf task 1 of 32
starting wrf task 8 of 32
starting wrf task 0 of 32
starting wrf task 9 of 32
starting wrf task 27 of 32
starting wrf task 16 of 32
starting wrf task 10 of 32
starting wrf task 25 of 32
starting wrf task 29 of 32
starting wrf task 30 of 32
starting wrf task 31 of 32
starting wrf task 21 of 32
starting wrf task 17 of 32
starting wrf task 24 of 32
starting wrf task 18 of 32
starting wrf task 22 of 32
starting wrf task 5 of 32
starting wrf task 15 of 32
starting wrf task 6 of 32
starting wrf task 7 of 32
starting wrf task 2 of 32
starting wrf task 3 of 32
starting wrf task 11 of 32
starting wrf task 14 of 32
starting wrf task 20 of 32
starting wrf task 19 of 32
starting wrf task 12 of 32
starting wrf task 26 of 32
starting wrf task 4 of 32
starting wrf task 28 of 32
starting wrf task 23 of 32
pbsdsh(): reconnected
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): sending obit for task 3
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): Event poll failed, error TM_ENOTCONNECTED
pbsdsh(): reconnected
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): sending obit for task 3
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
pbsdsh(): skipping obit resend for 0
etc
Some body can said me why ?
Thank you everybody
-----Message d'origine-----
De : torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] De la part de Brock Palen
Envoyé : vendredi 14 décembre 2012 16:03
À : Torque Users Mailing List
Objet : Re: [torqueusers] tm aware ssh
Thanks Gareth,
Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
brockp at umich.edu
(734)936-1985
On Dec 13, 2012, at 6:18 PM, <Gareth.Williams at csiro.au> <Gareth.Williams at csiro.au> wrote:
>> I have a few applications that spawn using ssh and don't support tm,
>>
>> There was once on the list a 'pbsssh' that wrapped pbsdsh to act
>> like ssh,
>>
>> It looks like that script no longer works and I am scratching my head
>> as to getting it working again (my bash fu is weak).
>>
>> Does anyone already have a way they do this?
>>
>> The hope is to get correct process reporting, and cleanup for
>> applications that don't support TM but spawn with ssh/rsh.
>>
>>
>> Thanks!
>>
>> Brock Palen
>
> Hi Brock,
>
> We have this (below) in place. You might need to swallow more ssh options if they are present - and there is no checking that "node" actually gets set to a cluster node name.
> Specifying a user would break this... but you would not expect that in a cluster inside a batch job, right?
>
> Gareth
>
>
> wil240 at burnet-login:~> cat /apps/ascutils/bin/pbsssh #!/bin/bash #
> $Id: pbsssh 2236 2012-05-02 03:16:17Z wil240 $ # $HeadURL:
> svn+ssh://stream/cs/home/svn/sysadmin/ascutils/common/pbsssh $
>
> usage="usage: $0 <node name> <command>"
>
> #swallow -x -n and -q (for intel mpi)
> while getopts "xqn" opt
> do
> :
> done
> shift $((OPTIND-1))
>
> if [ $# -lt 2 ]
> then
> echo $usage
> exit
> fi
>
> node=$1
>
> shift
>
> exec pbsdsh -o -h $node "$@"
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
More information about the torqueusers
mailing list