[torqueusers] killed by signal 15

Garrick Staples garrick at clusterresources.com
Sat Sep 9 00:56:58 MDT 2006


On Sat, Sep 09, 2006 at 08:48:35AM +0200, Guillaume Alleon alleged:
> Ususally I use torque to schedule my jobs and then use mpiexec to launch 
> my parallel code ;-)
> This my code is written in java ans use a java MPI implementation so 
> that I parse the $PBS_NODEFILE on
> the mother node start a server node on it and "ssh" a java command 
> starting my process on all nodes in the
> nodefile. I have a script for doing this (attached at the end)
> 
> This works fine when using qsub -I ... but all the pocesses are killed 
> by a signal 15 ? Any thought about what's going on ?
> 
> Here is my ugly script:
> VRAI=1
> NUM=0
> NP=`wc -l $PBS_NODEFILE | awk '{print $1}'`
> for i in `cat $PBS_NODEFILE`
> do
>  if [[ $VRAI = "1" && $HOSTNAME = $i ]]
>  then
>    echo "only on: $i ($VRAI)"
>    SERVEUR=$i
>    echo "the server is on : $SERVEUR"
>    ibis-nameserver -poolserver -single&
>    VRAI=0
>  fi
>  echo "ssh $i ibis-run -nhosts $NP -hostno $NUM -ns $SERVEUR -ns-port 
> 9826 RunHal &"
>  ssh $i ibis-run -nhosts $NP -hostno $NUM -ns $SERVEUR -ns-port 9826 
> RunHal &
>  NUM=`expr $NUM + 1`
> done

It looks you are running everything in the background, so the job script
immediately exits which causes the job to exit.

Put a 'wait' at the bottom of the script.

This MPI implementation doesn't come with an 'mpirun' command?



More information about the torqueusers mailing list