[torqueusers] PBS_NODEFILE issue

alap pandya arrow1533 at gmail.com
Tue Apr 20 06:57:14 MDT 2010


Hi,

I am facing issue while running job on multiple nodes on torque . Please
give me your suggestion.


Issue :
When i changed  *#PBS -l nodes=1:ppn=2  ----> * *#PBS -l nodes=2:ppn=2* in
script , PBS_NODEFILE is not created and finally not able to run job.

Note : similar issues mentioned at
         *
http://www.clusterresources.com/pipermail/torqueusers/2006-October/004434.html

http://www.clusterresources.com/pipermail/torqueusers/2010-January/009890.html
*

*Torque : 2.4.6 *

1> Running fine with single node.

#!/bin/sh
*#PBS -l nodes=1:ppn=2*
echo "HOSTNAME : $HOSTNAME"
echo "PBS_NODEFILE = $PBS_NODEFILE"
cd /disk
#echo $PBS_NODEFILE > shreenivas
cat $PBS_NODEFILE > pbsnodes
mpirun --hostfile $PBS_NODEFILE ./job1_100


*[root at cluster disk]# cat pbsnodes
cluster.hpc.org
cluster.hpc.org

*job is running fine with 2 processes on single node.

2> changed *#PBS -l nodes=1:ppn=2  ----> * *#PBS -l nodes=2:ppn=2* .....

#!/bin/sh
*#PBS -l nodes=2:ppn=2*
echo "HOSTNAME : $HOSTNAME"
echo "PBS_NODEFILE = $PBS_NODEFILE"
cd /disk
cat $PBS_NODEFILE > pbsnodes
mpirun --hostfile $PBS_NODEFILE ./job1_100

*[root at cluster disk]# cat pbsnodes
***there is no file created this time .....something strange ...no mpi job
is running on any nodes(compute-0-5,cluster) as shown in *tracejob* output
mentioned below. .

*tracejob output :*

04/20/2010 18:04:14  S    enqueuing into test, state 1 hop 1
04/20/2010 18:04:14  S    Job Queued at request of root at cluster, owner =
root at cluster, job name
                          = a.sh, queue = test
04/20/2010 18:04:14  S    Job Run at request of root at cluster
04/20/2010 18:04:14  A    queue=test
04/20/2010 18:04:14  A    user=root group=root jobname=a.sh queue=test
ctime=1271766854
                          qtime=1271766854 etime=1271766854 start=1271766854
owner=root at cluster
                          exec_host=compute-0-5/2+compute-0-5/1+
cluster.hpc.org/2+cluster.hpc.org/1
                          Resource_List.neednodes=2:ppn=2
Resource_List.nodect=2
                          Resource_List.nodes=2:ppn=2
Resource_List.walltime=01:00:00 *

...............................This sequence repeats many time as there is
no *PBS_NODEFILE created. MPI is not able to get nodelist.


With regards,
Alap
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20100420/8a2f79da/attachment.html 


More information about the torqueusers mailing list