[torqueusers] Re: Newbie torque script questions

Jerry Smith jdsmit at sandia.gov
Thu Dec 7 09:15:00 MST 2006


Dave,

Try in you pbs_script:

-l nodes=n10:ppn=4+n11:ppn=4+n12:ppn=4+n13:ppn=4

Make sure your $PBS_HOME/server_priv/nodes looks like

n10 np=4
n11 np=4
..
..


Just a follow up.  Are you wanting to get 4 nodes with 4 processors, and use
only 1 processor per node?  Your original mpirun line will only ask for 4
processors in which to run ( of which n10 has )

If you want to use all processors on all 4 nodes you would want to use ­np
16.

-nolocal assumes you do not want to run processes on the controlling pbs_mom
( n10 in this scenario ) therefore you are really only getting 12/16
processors.  

My other suggestion is to build Pete Wyckoff¹s mpiexec in place of mpirun,
as there are many advantages ( usage, differing flags, is built tightly into
the Torque job spawn  etc. )
http://www.osc.edu/~pw/mpiexec/index.php



Jerry Smith
-----------------------------------
Sandia national labs
Infrastructure Computing Systems



From: dave first <linux4dave at gmail.com>
Date: Wed, 6 Dec 2006 09:32:47 -0800
To: <torqueusers at supercluster.org>
Subject: [torqueusers] Re: Newbie torque script questions

New datapoint - I ran the job with a  2 minute sleep, and found the job
running only on n04, as qstat -f said it would be.

Why wouldn't qsub honor my local node list?

dave

On 12/6/06, dave first < linux4dave at gmail.com <mailto:linux4dave at gmail.com>
> wrote:
> I am such a newbie that I squeek.  I hope this is the correct forum in which
> to ask this question.
> 
> I want to specify a nodelist other than that which would be $PBS_NODEFILE.  I
> want to specify n10, n11, n12 and n13, each with 4 processors.  The node list
> looks something like this:
> 
> n10:4
> n11:4
> n12:4
> n13:4
> 
> And it is called local_nodelist in the working directory.
> 
> The script sets PBS_NODEFILE=`pwd`/local_nodelist
> 
> qstat -f while running the script elicits what seems to be an erroneous
> nodelist 
> 
> Job Id: 76.excalibur
>     Job_Name = pbs_mpich.
>     Job_Owner = joeb at excalibur.example.com
>     resources_used.cput = 00:00:00
>     resources_used.mem = 4296kb
>     resources_used.vmem = 175988kb
>     resources_used.walltime = 00:00:12
>     job_state = R
>     queue = default
>     server = excalibur.example.com <http://excalibur.example.com>
>     Checkpoint = u
>     ctime = Wed Dec  6 08:54:16 2006
>     Error_Path = excalibur.example.com <http://excalibur.example.com>
> :/home/joeb/pbs_mpich..e76
>     exec_host = n04/0
>     Hold_Types = n
>     Join_Path = n
>     Keep_Files = n
>     Mail_Points = a
>     mtime = Wed Dec  6 08:54:17 2006
>     Output_Path = excalibur.example.com <http://excalibur.example.com>
> :/home/joeb/pbs_mpich..o76
>     Priority = 0
>     qtime = Wed Dec  6 08:54:16 2006
>     Rerunable = True
>     Resource_List.nodect = 1
>     Resource_List.nodes = 1
>     session_id = 31725
>     Variable_List = PBS_O_HOME=/home/joeb,PBS_O_LANG=en_US.UTF-8,
>         PBS_O_LOGNAME=joeb,
>         PBS_O_PATH=/opt/torque/bin:/opt/bin:/opt/hdfview/bin:/opt/hdf/bin:/opt
>         
> /ncarg/bin:/opt/mpich/p4-gnu/bin:/opt/mpiexec//bin:/usr/kerberos/bin:/o
>         
> pt/java/jdk1.5.0/bin:/usr/lib64/ccache/bin:/usr/local/bin:/bin:/usr/bin
>         
> :/usr/X11R6/bin:/opt/java/jdk1.5.0/jre/bin:/opt/visit/bin:/home/joeb/bi
>         n:/opt/mpich/p4-gnu/sbin,PBS_O_MAIL=/var/spool/mail/joeb
>         PBS_O_SHELL=/bin/bash,PBS_O_HOST= excalibur.example.com
> <http://excalibur.example.com> ,
>         PBS_O_WORKDIR=/home/joeb,PBS_O_QUEUE=default
>     comment = Job started on Wed Dec 06 at 08:54
>     etime = Wed Dec  6 08:54:16 2006
> ------------------------------------------------------------------------------
> --- 
> 
> However, the script output looks like this:
> 
> Job ID: 76.excalibur.example.com <http://76.excalibur.example.com>
> Working directory is /home/joeb
> Running on host n04.example.com <http://n04.example.com>
> Time is Wed Dec 6 08:54:17 PST 2006
> Directory is /home/joeb
> The node file is /net/fs/home/joeb/local_nodefile
> This job runs on the following processors:
> n09.example.com:4 <http://n09.example.com:4>  n10.example.com:4
> <http://n10.example.com:4>  n11.example.com:4 <http://n11.example.com:4>
> n12.example.com:4 <http://n12.example.com:4>
> This job has allocated 4 nodes/processors.
> 
> /usr/local/bin/mpich/x86_64/p4/gnu/bin/mpirun -nolocal -np 4 -machinefile
> /net/fs/home/joeb/local_nodefile /usr/local/bin/mpich/p
> 4-gnu/examples/cpi
> 
> pi is approximately 3.1416009869231249, Error is 0.0000083333333318
> wall clock time = 0.003906
> ------------------------------------------------------------------------------
> --- 
> 
> Can anyone explain why the output of qstat -f and the script echo statements
> differ, and how can I determine which is correct?  (Short of sleeping for a
> while while I look for all the processes?)
> 
> Thanks,
> dave
> 



_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20061207/32378fc2/attachment.html


More information about the torqueusers mailing list