[torqueusers] submitted jobs not running all nodes
Mark.W.Moorcroft at nasa.gov
Wed Jun 19 12:28:49 MDT 2013
I am testing a beta of a torque/maui "roll" for Rocks clustering
software. This is supposed to be torque 4.2.2 and I run on CentOS 6.x. I
seem to be having the same issues. Everything appears to be distributed
to both of my test nodes (according to the maui logs) but all the work
runs on one node?
> I'm running a coupe of clusters one 64 node cluster and one 4 node
> cluster utilizing the default torque package for scheduling and
> everything else. When I try to submit a job that will utilize more than
> one node it appears that it will not use all of the nodes, but rather it
> stays on one node. When I run tracejob <job-id> or qstat -f <job-id> it
> shows that the nodes have been allocated to the job and everything
> appears to be fine. If I go to the nodes individually and run top or ps
> -ef the job will only appear on one node and use only the processors of
> that node.
Here is my standard test submit script:
#PBS -S /bin/bash
#PBS -l nodes=2:ppn=2,walltime=8:00:00
#PBS -j oe
#PBS -N xhpl2node
#PBS -m e
ln -fs HPL.dat2node HPL.dat
mpirun -v -bynode -np 4 ./xhpl
If I specify the hosts with the -H switch the work goes to the correct
nodes as expected. I suspect this means it's a maui issue and not a
torque issue, but I was hoping someone has some ideas. I am not on a
maui list yet.
Thanks in advance
p.s. Is there a searchable version of this list archive anywhere?
More information about the torqueusers