[torqueusers] PBS_NODEFILE issue

Si Hammond simon.hammond at gmail.com
Tue Apr 20 12:04:54 MDT 2010


Hi,

We're running 2.4.7 and I can cat the $PBS_NODEFILE in both -l nodes=2:ppn=2 and -l nodes=1:ppn=2 configurations (i.e. works for me fine).

If you have built OpenMPI with --with-tm then you shouldn't need to specify the node file right? The runtime picks this up from the PBS engine during execution?

Have you tried just a basic mpirun ./pingpong or something like that?




S.


On 20 Apr 2010, at 13:57, alap pandya wrote:

> Hi,
> 
> I am facing issue while running job on multiple nodes on torque . Please give me your suggestion.  
> 
> 
> Issue :
> When i changed  #PBS -l nodes=1:ppn=2  ---->  #PBS -l nodes=2:ppn=2 in script , PBS_NODEFILE is not created and finally not able to run job.
> 
> Note : similar issues mentioned at  
>          http://www.clusterresources.com/pipermail/torqueusers/2006-October/004434.html
>          http://www.clusterresources.com/pipermail/torqueusers/2010-January/009890.html
> 
> 
> Torque : 2.4.6 
> 
> 1> Running fine with single node.
> 
> #!/bin/sh
> #PBS -l nodes=1:ppn=2
> echo "HOSTNAME : $HOSTNAME"
> echo "PBS_NODEFILE = $PBS_NODEFILE"
> cd /disk
> #echo $PBS_NODEFILE > shreenivas
> cat $PBS_NODEFILE > pbsnodes
> mpirun --hostfile $PBS_NODEFILE ./job1_100
> 
> 
> [root at cluster disk]# cat pbsnodes 
> cluster.hpc.org
> cluster.hpc.org
> 
> job is running fine with 2 processes on single node.
> 
> 2> changed #PBS -l nodes=1:ppn=2  ---->  #PBS -l nodes=2:ppn=2 .....
> 
> #!/bin/sh
> #PBS -l nodes=2:ppn=2
> echo "HOSTNAME : $HOSTNAME"
> echo "PBS_NODEFILE = $PBS_NODEFILE"
> cd /disk
> cat $PBS_NODEFILE > pbsnodes
> mpirun --hostfile $PBS_NODEFILE ./job1_100
> 
> [root at cluster disk]# cat pbsnodes 
> there is no file created this time .....something strange ...no mpi job is running on any nodes(compute-0-5,cluster) as shown in tracejob output mentioned below. .
> 
> tracejob output :
> 
> 04/20/2010 18:04:14  S    enqueuing into test, state 1 hop 1
> 04/20/2010 18:04:14  S    Job Queued at request of root at cluster, owner = root at cluster, job name
>                           = a.sh, queue = test
> 04/20/2010 18:04:14  S    Job Run at request of root at cluster
> 04/20/2010 18:04:14  A    queue=test
> 04/20/2010 18:04:14  A    user=root group=root jobname=a.sh queue=test ctime=1271766854
>                           qtime=1271766854 etime=1271766854 start=1271766854 owner=root at cluster
>                           exec_host=compute-0-5/2+compute-0-5/1+cluster.hpc.org/2+cluster.hpc.org/1
>                           Resource_List.neednodes=2:ppn=2 Resource_List.nodect=2
>                           Resource_List.nodes=2:ppn=2 Resource_List.walltime=01:00:00 
> 
> ...............................This sequence repeats many time as there is no PBS_NODEFILE created. MPI is not able to get nodelist.
> 
> 
> With regards,
> Alap
> 
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


---------------------------------------------------------------------------------------
Si Hammond

Research & Knowledge Transfer Associate
Performance Modelling, Analysis and Optimisation Team
High Performance Systems Group
Department of Computer Science
University of Warwick, CV4 7AL, UK
http://go.warwick.ac.uk/hpsg
----------------------------------------------------------------------------------------



-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20100420/68aef21f/attachment-0001.html 


More information about the torqueusers mailing list