[torqueusers] problem about running jobs on multiple nodes
rcbord at wm.edu
rcbord at wm.edu
Mon Nov 12 12:08:40 MST 2007
Hi,
I can't tell if he is running mpi correctly or not. But I have twice had
this problem happen to me in the last month. The last time after I
modified the /var/spool/pbs/server_priv/node file I restarted the
pbs_server, pbs_sched. I could see all of my changes in via the
pbsnodes -a command and they were correct. But I could not run on more
than two processors (a single node). The PBS_NODEFILE output from my
batch script only listed two processors. The 'qstat -f' showed the
same too but you gotta be quick to catch that as the jobs die
immediately. After several hours and many iterations/ variations of
stoping and restarting pbs_server, pbs_sched and pbs_mom. I deleted all
the queues and just went back to the very simplest queue from the torque
setup. Restarted the pbs_server and pbs_sched again and was able to run
again. I then deleted that queue and re-created all the queues again
qmgr < queue_file. I was then able to run a set of 24 process test jobs
on the original queues again.
We have 12 nodes with SLES10/ofed-1.2/mpivapich/mpiexec(from OSC) and I
had been running several different codes on our cluster uninterrupted for
nearly a month. Made the change to the node file and it just quit
working. We are using Torque-2.1.9 with the default pbs_scheduler.
So something in torque is causing this problem with exec_host
not allocating the node resource properly.
It may very well be the default pbs_scheduler and once we
go to Maui that may resolve it. But right now I have not seen anything
on the mailing list with regards to a solution or a work around. But
I would recommend deleting the queues and adding the very simplest batch
queue then restarting the pbs_server and pbs_sched. That seemed to
correct the exec_host problem.
Chris Bording
Application Analyst
High Performance Computing Group
Information Technology
The College of William and Mary
(757)-221-3488
rcbord at wm.edu
On Mon, 12 Nov 2007, Garrick Staples wrote:
> On Mon, Nov 12, 2007 at 10:30:25AM +0800, Chien-Pin Chou alleged:
>> Hello:
>>
>> I have a problem about running jobs on multiple nodes (n>1)
>>
>> when I use qsub -l nodes=2:ppn=2 for testing,
>> but it just select 2 cpus in one node instead of choosing 2 cpus per node,
>> which is total 4 cpus to run
>>
>> my test script is :
>> #=========================
>> cd $PBS_O_WORKDIR
>> NPROCS=`wc -l < $PBS_NODEFILE`
>> echo $NPROCS
>> cat $PBS_NODEFILE
>> echo "...."
>> /opt/openmpi/bin/mpirun -np $NPROCS -machinefile $PBS_NODEFILE hostname
>
> You are running mpirun incorrectly.
>
> http://www.open-mpi.org/faq/?category=tm
>
>
More information about the torqueusers
mailing list