[torqueusers] Re: Newbie torque script questions

dave first linux4dave at gmail.com
Wed Dec 6 10:32:47 MST 2006


New datapoint - I ran the job with a  2 minute sleep, and found the job
running only on n04, as qstat -f said it would be.

Why wouldn't qsub honor my local node list?

dave

On 12/6/06, dave first <linux4dave at gmail.com> wrote:
>
> I am such a newbie that I squeek.  I hope this is the correct forum in
> which to ask this question.
>
> I want to specify a nodelist other than that which would be
> $PBS_NODEFILE.  I want to specify n10, n11, n12 and n13, each with 4
> processors.  The node list looks something like this:
>
> n10:4
> n11:4
> n12:4
> n13:4
>
> And it is called local_nodelist in the working directory.
>
> The script sets PBS_NODEFILE=`pwd`/local_nodelist
>
> qstat -f while running the script elicits what seems to be an erroneous
> nodelist
>
> Job Id: 76.excalibur
>     Job_Name = pbs_mpich.
>     Job_Owner = joeb at excalibur.example.com
>     resources_used.cput = 00:00:00
>     resources_used.mem = 4296kb
>     resources_used.vmem = 175988kb
>     resources_used.walltime = 00:00:12
>     job_state = R
>     queue = default
>     server = excalibur.example.com
>     Checkpoint = u
>     ctime = Wed Dec  6 08:54:16 2006
>     Error_Path = excalibur.example.com:/home/joeb/pbs_mpich..e76
>     exec_host = n04/0
>     Hold_Types = n
>     Join_Path = n
>     Keep_Files = n
>     Mail_Points = a
>     mtime = Wed Dec  6 08:54:17 2006
>     Output_Path = excalibur.example.com :/home/joeb/pbs_mpich..o76
>     Priority = 0
>     qtime = Wed Dec  6 08:54:16 2006
>     Rerunable = True
>     Resource_List.nodect = 1
>     Resource_List.nodes = 1
>     session_id = 31725
>     Variable_List = PBS_O_HOME=/home/joeb,PBS_O_LANG=en_US.UTF-8,
>         PBS_O_LOGNAME=joeb,
>
> PBS_O_PATH=/opt/torque/bin:/opt/bin:/opt/hdfview/bin:/opt/hdf/bin:/opt
>
> /ncarg/bin:/opt/mpich/p4-gnu/bin:/opt/mpiexec//bin:/usr/kerberos/bin:/o
>
> pt/java/jdk1.5.0/bin:/usr/lib64/ccache/bin:/usr/local/bin:/bin:/usr/bin
>
> :/usr/X11R6/bin:/opt/java/jdk1.5.0/jre/bin:/opt/visit/bin:/home/joeb/bi
>         n:/opt/mpich/p4-gnu/sbin,PBS_O_MAIL=/var/spool/mail/joeb
>         PBS_O_SHELL=/bin/bash,PBS_O_HOST= excalibur.example.com ,
>         PBS_O_WORKDIR=/home/joeb,PBS_O_QUEUE=default
>     comment = Job started on Wed Dec 06 at 08:54
>     etime = Wed Dec  6 08:54:16 2006
> ---------------------------------------------------------------------------------
>
>
> However, the script output looks like this:
>
> Job ID: 76.excalibur.example.com
> Working directory is /home/joeb
> Running on host n04.example.com
> Time is Wed Dec 6 08:54:17 PST 2006
> Directory is /home/joeb
> The node file is /net/fs/home/joeb/local_nodefile
> This job runs on the following processors:
> n09.example.com:4 n10.example.com:4 n11.example.com:4 n12.example.com:4
> This job has allocated 4 nodes/processors.
>
> /usr/local/bin/mpich/x86_64/p4/gnu/bin/mpirun -nolocal -np 4 -machinefile
> /net/fs/home/joeb/local_nodefile /usr/local/bin/mpich/p
> 4-gnu/examples/cpi
>
> pi is approximately 3.1416009869231249, Error is 0.0000083333333318
> wall clock time = 0.003906
> ---------------------------------------------------------------------------------
>
>
> Can anyone explain why the output of qstat -f and the script echo
> statements differ, and how can I determine which is correct?  (Short of
> sleeping for a while while I look for all the processes?)
>
> Thanks,
> dave
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20061206/61fe5975/attachment.html


More information about the torqueusers mailing list