[torqueusers] Re: Newbie torque script questions
Jerry Smith
jdsmit at sandia.gov
Thu Dec 7 09:15:00 MST 2006
Dave,
Try in you pbs_script:
-l nodes=n10:ppn=4+n11:ppn=4+n12:ppn=4+n13:ppn=4
Make sure your $PBS_HOME/server_priv/nodes looks like
n10 np=4
n11 np=4
..
..
Just a follow up. Are you wanting to get 4 nodes with 4 processors, and use
only 1 processor per node? Your original mpirun line will only ask for 4
processors in which to run ( of which n10 has )
If you want to use all processors on all 4 nodes you would want to use np
16.
-nolocal assumes you do not want to run processes on the controlling pbs_mom
( n10 in this scenario ) therefore you are really only getting 12/16
processors.
My other suggestion is to build Pete Wyckoff¹s mpiexec in place of mpirun,
as there are many advantages ( usage, differing flags, is built tightly into
the Torque job spawn etc. )
http://www.osc.edu/~pw/mpiexec/index.php
Jerry Smith
-----------------------------------
Sandia national labs
Infrastructure Computing Systems
From: dave first <linux4dave at gmail.com>
Date: Wed, 6 Dec 2006 09:32:47 -0800
To: <torqueusers at supercluster.org>
Subject: [torqueusers] Re: Newbie torque script questions
New datapoint - I ran the job with a 2 minute sleep, and found the job
running only on n04, as qstat -f said it would be.
Why wouldn't qsub honor my local node list?
dave
On 12/6/06, dave first < linux4dave at gmail.com <mailto:linux4dave at gmail.com>
> wrote:
> I am such a newbie that I squeek. I hope this is the correct forum in which
> to ask this question.
>
> I want to specify a nodelist other than that which would be $PBS_NODEFILE. I
> want to specify n10, n11, n12 and n13, each with 4 processors. The node list
> looks something like this:
>
> n10:4
> n11:4
> n12:4
> n13:4
>
> And it is called local_nodelist in the working directory.
>
> The script sets PBS_NODEFILE=`pwd`/local_nodelist
>
> qstat -f while running the script elicits what seems to be an erroneous
> nodelist
>
> Job Id: 76.excalibur
> Job_Name = pbs_mpich.
> Job_Owner = joeb at excalibur.example.com
> resources_used.cput = 00:00:00
> resources_used.mem = 4296kb
> resources_used.vmem = 175988kb
> resources_used.walltime = 00:00:12
> job_state = R
> queue = default
> server = excalibur.example.com <http://excalibur.example.com>
> Checkpoint = u
> ctime = Wed Dec 6 08:54:16 2006
> Error_Path = excalibur.example.com <http://excalibur.example.com>
> :/home/joeb/pbs_mpich..e76
> exec_host = n04/0
> Hold_Types = n
> Join_Path = n
> Keep_Files = n
> Mail_Points = a
> mtime = Wed Dec 6 08:54:17 2006
> Output_Path = excalibur.example.com <http://excalibur.example.com>
> :/home/joeb/pbs_mpich..o76
> Priority = 0
> qtime = Wed Dec 6 08:54:16 2006
> Rerunable = True
> Resource_List.nodect = 1
> Resource_List.nodes = 1
> session_id = 31725
> Variable_List = PBS_O_HOME=/home/joeb,PBS_O_LANG=en_US.UTF-8,
> PBS_O_LOGNAME=joeb,
> PBS_O_PATH=/opt/torque/bin:/opt/bin:/opt/hdfview/bin:/opt/hdf/bin:/opt
>
> /ncarg/bin:/opt/mpich/p4-gnu/bin:/opt/mpiexec//bin:/usr/kerberos/bin:/o
>
> pt/java/jdk1.5.0/bin:/usr/lib64/ccache/bin:/usr/local/bin:/bin:/usr/bin
>
> :/usr/X11R6/bin:/opt/java/jdk1.5.0/jre/bin:/opt/visit/bin:/home/joeb/bi
> n:/opt/mpich/p4-gnu/sbin,PBS_O_MAIL=/var/spool/mail/joeb
> PBS_O_SHELL=/bin/bash,PBS_O_HOST= excalibur.example.com
> <http://excalibur.example.com> ,
> PBS_O_WORKDIR=/home/joeb,PBS_O_QUEUE=default
> comment = Job started on Wed Dec 06 at 08:54
> etime = Wed Dec 6 08:54:16 2006
> ------------------------------------------------------------------------------
> ---
>
> However, the script output looks like this:
>
> Job ID: 76.excalibur.example.com <http://76.excalibur.example.com>
> Working directory is /home/joeb
> Running on host n04.example.com <http://n04.example.com>
> Time is Wed Dec 6 08:54:17 PST 2006
> Directory is /home/joeb
> The node file is /net/fs/home/joeb/local_nodefile
> This job runs on the following processors:
> n09.example.com:4 <http://n09.example.com:4> n10.example.com:4
> <http://n10.example.com:4> n11.example.com:4 <http://n11.example.com:4>
> n12.example.com:4 <http://n12.example.com:4>
> This job has allocated 4 nodes/processors.
>
> /usr/local/bin/mpich/x86_64/p4/gnu/bin/mpirun -nolocal -np 4 -machinefile
> /net/fs/home/joeb/local_nodefile /usr/local/bin/mpich/p
> 4-gnu/examples/cpi
>
> pi is approximately 3.1416009869231249, Error is 0.0000083333333318
> wall clock time = 0.003906
> ------------------------------------------------------------------------------
> ---
>
> Can anyone explain why the output of qstat -f and the script echo statements
> differ, and how can I determine which is correct? (Short of sleeping for a
> while while I look for all the processes?)
>
> Thanks,
> dave
>
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20061207/32378fc2/attachment.html
More information about the torqueusers
mailing list