[torqueusers] PBS_NODEFILE issue
alap pandya
arrow1533 at gmail.com
Tue Apr 20 06:57:14 MDT 2010
Hi,
I am facing issue while running job on multiple nodes on torque . Please
give me your suggestion.
Issue :
When i changed *#PBS -l nodes=1:ppn=2 ----> * *#PBS -l nodes=2:ppn=2* in
script , PBS_NODEFILE is not created and finally not able to run job.
Note : similar issues mentioned at
*
http://www.clusterresources.com/pipermail/torqueusers/2006-October/004434.html
http://www.clusterresources.com/pipermail/torqueusers/2010-January/009890.html
*
*Torque : 2.4.6 *
1> Running fine with single node.
#!/bin/sh
*#PBS -l nodes=1:ppn=2*
echo "HOSTNAME : $HOSTNAME"
echo "PBS_NODEFILE = $PBS_NODEFILE"
cd /disk
#echo $PBS_NODEFILE > shreenivas
cat $PBS_NODEFILE > pbsnodes
mpirun --hostfile $PBS_NODEFILE ./job1_100
*[root at cluster disk]# cat pbsnodes
cluster.hpc.org
cluster.hpc.org
*job is running fine with 2 processes on single node.
2> changed *#PBS -l nodes=1:ppn=2 ----> * *#PBS -l nodes=2:ppn=2* .....
#!/bin/sh
*#PBS -l nodes=2:ppn=2*
echo "HOSTNAME : $HOSTNAME"
echo "PBS_NODEFILE = $PBS_NODEFILE"
cd /disk
cat $PBS_NODEFILE > pbsnodes
mpirun --hostfile $PBS_NODEFILE ./job1_100
*[root at cluster disk]# cat pbsnodes
***there is no file created this time .....something strange ...no mpi job
is running on any nodes(compute-0-5,cluster) as shown in *tracejob* output
mentioned below. .
*tracejob output :*
04/20/2010 18:04:14 S enqueuing into test, state 1 hop 1
04/20/2010 18:04:14 S Job Queued at request of root at cluster, owner =
root at cluster, job name
= a.sh, queue = test
04/20/2010 18:04:14 S Job Run at request of root at cluster
04/20/2010 18:04:14 A queue=test
04/20/2010 18:04:14 A user=root group=root jobname=a.sh queue=test
ctime=1271766854
qtime=1271766854 etime=1271766854 start=1271766854
owner=root at cluster
exec_host=compute-0-5/2+compute-0-5/1+
cluster.hpc.org/2+cluster.hpc.org/1
Resource_List.neednodes=2:ppn=2
Resource_List.nodect=2
Resource_List.nodes=2:ppn=2
Resource_List.walltime=01:00:00 *
...............................This sequence repeats many time as there is
no *PBS_NODEFILE created. MPI is not able to get nodelist.
With regards,
Alap
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20100420/8a2f79da/attachment.html
More information about the torqueusers
mailing list