[torqueusers] Issue running jobs on multiple nodes on Cluster with
Ben.Joseph at utas.edu.au
Sun Nov 16 17:12:48 MST 2008
I have a weird issue with our cluster.
I can run MPI manually without an issue, with MPICH.
/apps/mpich/1.2.7p1/bin/mpirun -machinefile machinefile.txt -np 16 /scratch/bjoseph/mpi_test/cpi
That runs fine, the cpi command gives me the output I was expecting, and I can see all the processes execute on the requested nodes.
I can't however run that in a qsub script, and I can't run an interactive job over multiple nodes.
qsub -I -l nodes=1:ppn=6 runs fine.
qsub -I -l nodes=2:ppn=6 hangs. The job looks like its running in the queue, but I never get a shell. I can't find any useful info in any of the logs either.
With PBSDEBUG=yes this is the only output I get:
bjoseph at r1lead:~> qsub -I -l nodes=2:ppn=8
pbs_connect using default server name list "r1lead"
pbs_connect attempting connection to server "r1lead"
pbs_connect: Successful connection to server "r1lead", fd = 1
qsub: waiting for job 1996.r1lead.ice.ice.internal to start
I get the same kind of thing running the MPI over multiple nodes as well. The jobs submits, and looks like its running, but you check the nodes its running on and there is nothing.
I can't find anything logged anywhere, and I'm pulling my hair out trying to fix it!
Any help would be greatly appreciated.
Information Technology Resources/ TPAC
Ben.Joseph at utas.edu.au<mailto:Ben.Joseph at utas.edu.au>
Ph: (03) 6226 6217
That's what's cool about working with computers. They don't argue, they remember everything and they don't drink all your beer.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the torqueusers