[torqueusers] Issue running jobs on multiple nodes on Cluster with Torque 2.3.3

Ben Joseph Ben.Joseph at utas.edu.au
Sun Nov 16 17:12:48 MST 2008


HI All,
I have a weird issue with our cluster.

I can run MPI manually without an issue, with MPICH.

/apps/mpich/1.2.7p1/bin/mpirun -machinefile machinefile.txt -np 16 /scratch/bjoseph/mpi_test/cpi

That runs fine, the cpi command gives me the output I was expecting, and I can see all the processes execute on the requested nodes.

I can't however run that in a qsub script, and I can't run an interactive job over multiple nodes.

qsub -I -l nodes=1:ppn=6 runs fine.

qsub -I -l nodes=2:ppn=6 hangs. The job looks like its running in the queue, but I never get a shell. I can't find any useful info in any of the logs either.

With PBSDEBUG=yes this is the only output I get:
bjoseph at r1lead:~> qsub -I -l nodes=2:ppn=8
xauth_path=/usr/X11R6/bin/xauth
pbs_connect using default server name list "r1lead"
pbs_connect attempting connection to server "r1lead"
pbs_connect: Successful connection to server "r1lead", fd = 1
qsub: waiting for job 1996.r1lead.ice.ice.internal to start

I get the same kind of thing running the MPI over multiple nodes as well. The jobs submits, and looks like its running, but you check the nodes its running on and there is nothing.

I can't find anything logged anywhere, and I'm pulling my hair out trying to fix it!

Any help would be greatly appreciated.

Regards,
Ben.
--
Ben Joseph
HPC Administrator
Information Technology Resources/ TPAC
www.tpac.org.au<http://www.tpac.org.au>
Ben.Joseph at utas.edu.au<mailto:Ben.Joseph at utas.edu.au>
Ph: (03) 6226 6217

That's what's cool about working with computers. They don't argue, they remember everything and they don't drink all your beer.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20081117/845370bf/attachment-0001.html


More information about the torqueusers mailing list