[torqueusers] problem about running jobs on multiple nodes

rcbord at wm.edu rcbord at wm.edu
Mon Nov 12 12:08:40 MST 2007


Hi,
   I can't tell if he is running mpi correctly or not. But I have twice had
this problem happen to me in the last month.  The last time after I 
modified the /var/spool/pbs/server_priv/node file I restarted the
pbs_server, pbs_sched.  I could see all of my changes in via the
pbsnodes -a command and they were correct.  But I could not run on more 
than two processors (a single node).  The PBS_NODEFILE output from my 
batch script only listed two processors. The 'qstat -f' showed the 
same too but you gotta be quick to catch that as the jobs die 
immediately.  After several hours and many iterations/ variations of
stoping and restarting pbs_server, pbs_sched and pbs_mom.  I deleted all 
the queues and just went back to the very simplest queue from the torque
setup.  Restarted the pbs_server and pbs_sched again and was able to run
again. I then deleted that queue and re-created all the queues again
qmgr < queue_file.  I was then able to run a set of 24 process test jobs 
on the original queues again.

   We have 12 nodes with SLES10/ofed-1.2/mpivapich/mpiexec(from OSC) and I 
had been running several different codes on our cluster uninterrupted for 
nearly a month.  Made the change to the node file and it just quit 
working.  We are using Torque-2.1.9 with the default pbs_scheduler.
So something in torque is causing this problem with exec_host
not allocating the node resource properly.

It may very well be the default pbs_scheduler and once we
go to Maui that may resolve it.  But right now I have not seen anything
on the mailing list with regards to a solution or a work around.  But
I would recommend deleting the queues and adding the very simplest batch
queue then restarting the pbs_server and pbs_sched.  That seemed to 
correct the exec_host problem.



Chris Bording
Application Analyst
High Performance Computing Group 
Information Technology
The College of William and Mary
(757)-221-3488
rcbord at wm.edu

On Mon, 12 Nov 2007, Garrick Staples wrote:

> On Mon, Nov 12, 2007 at 10:30:25AM +0800, Chien-Pin Chou alleged:
>> Hello:
>>
>> I have a problem about running jobs on multiple nodes (n>1)
>>
>> when I use qsub -l nodes=2:ppn=2 for testing,
>> but it just select 2 cpus in one node instead of choosing 2 cpus per node,
>> which is total 4 cpus to run
>>
>> my test script is :
>> #=========================
>> cd $PBS_O_WORKDIR
>> NPROCS=`wc -l < $PBS_NODEFILE`
>> echo $NPROCS
>> cat $PBS_NODEFILE
>> echo "...."
>> /opt/openmpi/bin/mpirun -np $NPROCS -machinefile $PBS_NODEFILE hostname
>
> You are running mpirun incorrectly.
>
> http://www.open-mpi.org/faq/?category=tm
>
>


More information about the torqueusers mailing list