[torqueusers] PBS_NODEFILE issue
Si Hammond
simon.hammond at gmail.com
Tue Apr 20 12:04:54 MDT 2010
Hi,
We're running 2.4.7 and I can cat the $PBS_NODEFILE in both -l nodes=2:ppn=2 and -l nodes=1:ppn=2 configurations (i.e. works for me fine).
If you have built OpenMPI with --with-tm then you shouldn't need to specify the node file right? The runtime picks this up from the PBS engine during execution?
Have you tried just a basic mpirun ./pingpong or something like that?
S.
On 20 Apr 2010, at 13:57, alap pandya wrote:
> Hi,
>
> I am facing issue while running job on multiple nodes on torque . Please give me your suggestion.
>
>
> Issue :
> When i changed #PBS -l nodes=1:ppn=2 ----> #PBS -l nodes=2:ppn=2 in script , PBS_NODEFILE is not created and finally not able to run job.
>
> Note : similar issues mentioned at
> http://www.clusterresources.com/pipermail/torqueusers/2006-October/004434.html
> http://www.clusterresources.com/pipermail/torqueusers/2010-January/009890.html
>
>
> Torque : 2.4.6
>
> 1> Running fine with single node.
>
> #!/bin/sh
> #PBS -l nodes=1:ppn=2
> echo "HOSTNAME : $HOSTNAME"
> echo "PBS_NODEFILE = $PBS_NODEFILE"
> cd /disk
> #echo $PBS_NODEFILE > shreenivas
> cat $PBS_NODEFILE > pbsnodes
> mpirun --hostfile $PBS_NODEFILE ./job1_100
>
>
> [root at cluster disk]# cat pbsnodes
> cluster.hpc.org
> cluster.hpc.org
>
> job is running fine with 2 processes on single node.
>
> 2> changed #PBS -l nodes=1:ppn=2 ----> #PBS -l nodes=2:ppn=2 .....
>
> #!/bin/sh
> #PBS -l nodes=2:ppn=2
> echo "HOSTNAME : $HOSTNAME"
> echo "PBS_NODEFILE = $PBS_NODEFILE"
> cd /disk
> cat $PBS_NODEFILE > pbsnodes
> mpirun --hostfile $PBS_NODEFILE ./job1_100
>
> [root at cluster disk]# cat pbsnodes
> there is no file created this time .....something strange ...no mpi job is running on any nodes(compute-0-5,cluster) as shown in tracejob output mentioned below. .
>
> tracejob output :
>
> 04/20/2010 18:04:14 S enqueuing into test, state 1 hop 1
> 04/20/2010 18:04:14 S Job Queued at request of root at cluster, owner = root at cluster, job name
> = a.sh, queue = test
> 04/20/2010 18:04:14 S Job Run at request of root at cluster
> 04/20/2010 18:04:14 A queue=test
> 04/20/2010 18:04:14 A user=root group=root jobname=a.sh queue=test ctime=1271766854
> qtime=1271766854 etime=1271766854 start=1271766854 owner=root at cluster
> exec_host=compute-0-5/2+compute-0-5/1+cluster.hpc.org/2+cluster.hpc.org/1
> Resource_List.neednodes=2:ppn=2 Resource_List.nodect=2
> Resource_List.nodes=2:ppn=2 Resource_List.walltime=01:00:00
>
> ...............................This sequence repeats many time as there is no PBS_NODEFILE created. MPI is not able to get nodelist.
>
>
> With regards,
> Alap
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
---------------------------------------------------------------------------------------
Si Hammond
Research & Knowledge Transfer Associate
Performance Modelling, Analysis and Optimisation Team
High Performance Systems Group
Department of Computer Science
University of Warwick, CV4 7AL, UK
http://go.warwick.ac.uk/hpsg
----------------------------------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20100420/68aef21f/attachment-0001.html
More information about the torqueusers
mailing list