[torqueusers] submitted jobs not running all nodes

Mark Moorcroft Mark.W.Moorcroft at nasa.gov
Wed Jun 19 12:28:49 MDT 2013


I am testing a beta of a torque/maui "roll" for Rocks clustering 
software. This is supposed to be torque 4.2.2 and I run on CentOS 6.x. I 
seem to be having the same issues. Everything appears to be distributed 
to both of my test nodes (according to the maui logs) but all the work 
runs on one node?


> I'm running a coupe of clusters one 64 node cluster and one 4 node
> cluster utilizing the default torque package for scheduling and
> everything else. When I try to submit a job that will utilize more than
> one node it appears that it will not use all of the nodes, but rather it
> stays on one node. When I run tracejob <job-id> or qstat -f <job-id> it
> shows that the nodes have been allocated to the job and everything
> appears to be fine. If I go to the nodes individually and run top or ps
> -ef the job will only appear on one node and use only the processors of
> that node.


Here is my standard test submit script:


#PBS -S /bin/bash
#PBS -l nodes=2:ppn=2,walltime=8:00:00
#PBS -j oe
#PBS -N xhpl2node
#PBS -m e
#

echo $PBS_O_WORKDIR
cd $PBS_O_WORKDIR
ln -fs HPL.dat2node HPL.dat

mpirun -v -bynode -np 4 ./xhpl


If I specify the hosts with the -H switch the work goes to the correct 
nodes as expected. I suspect this means it's a maui issue and not a 
torque issue, but I was hoping someone has some ideas. I am not on a 
maui list yet.


Thanks in advance


p.s. Is there a searchable version of this list archive anywhere?


More information about the torqueusers mailing list