[torqueusers] submitted jobs not running all nodes

Mark Moorcroft mntbighker at gmail.com
Wed Jun 19 17:17:31 MDT 2013


I am testing a beta of a torque/maui "roll" for Rocks clustering
software. This is supposed to be torque 4.2.2 and I run on CentOS 6.x.
I seem to be having the same issues. Everything appears to be
distributed to both of my test nodes (according to the maui logs) but
all the work runs on one node?


> I'm running a coupe of clusters one 64 node cluster and one 4 node
> cluster utilizing the default torque package for scheduling and
> everything else. When I try to submit a job that will utilize more than
> one node it appears that it will not use all of the nodes, but rather it
> stays on one node. When I run tracejob <job-id> or qstat -f <job-id> it
> shows that the nodes have been allocated to the job and everything
> appears to be fine. If I go to the nodes individually and run top or ps
> -ef the job will only appear on one node and use only the processors of
> that node.


Here is my standard test submit script:


#PBS -S /bin/bash
#PBS -l nodes=2:ppn=2,walltime=8:00:00
#PBS -j oe
#PBS -N xhpl2node
#PBS -m e
#

echo $PBS_O_WORKDIR
cd $PBS_O_WORKDIR
ln -fs HPL.dat2node HPL.dat

mpirun -v -bynode -np 4 ./xhpl


If I specify the hosts with the -H switch the work goes to the correct
nodes as expected. I suspect this means it's a maui issue and not a
torque issue, but I was hoping someone has some ideas. I am not on a
maui list yet.


Thanks in advance


p.s. Is there a searchable version of this list archive anywhere?


More information about the torqueusers mailing list