[torqueusers] Newbie torque script questions

dave first linux4dave at gmail.com
Wed Dec 6 10:15:31 MST 2006


I am such a newbie that I squeek.  I hope this is the correct forum in which
to ask this question.

I want to specify a nodelist other than that which would be $PBS_NODEFILE.
I want to specify n10, n11, n12 and n13, each with 4 processors.  The node
list looks something like this:

n10:4
n11:4
n12:4
n13:4

And it is called local_nodelist in the working directory.

The script sets PBS_NODEFILE=`pwd`/local_nodelist

qstat -f while running the script elicits what seems to be an erroneous
nodelist

Job Id: 76.excalibur
    Job_Name = pbs_mpich.
    Job_Owner = joeb at excalibur.example.com
    resources_used.cput = 00:00:00
    resources_used.mem = 4296kb
    resources_used.vmem = 175988kb
    resources_used.walltime = 00:00:12
    job_state = R
    queue = default
    server = excalibur.example.com
    Checkpoint = u
    ctime = Wed Dec  6 08:54:16 2006
    Error_Path = excalibur.example.com:/home/joeb/pbs_mpich..e76
    exec_host = n04/0
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = a
    mtime = Wed Dec  6 08:54:17 2006
    Output_Path = excalibur.example.com:/home/joeb/pbs_mpich..o76
    Priority = 0
    qtime = Wed Dec  6 08:54:16 2006
    Rerunable = True
    Resource_List.nodect = 1
    Resource_List.nodes = 1
    session_id = 31725
    Variable_List = PBS_O_HOME=/home/joeb,PBS_O_LANG=en_US.UTF-8,
        PBS_O_LOGNAME=joeb,

PBS_O_PATH=/opt/torque/bin:/opt/bin:/opt/hdfview/bin:/opt/hdf/bin:/opt

/ncarg/bin:/opt/mpich/p4-gnu/bin:/opt/mpiexec//bin:/usr/kerberos/bin:/o

pt/java/jdk1.5.0/bin:/usr/lib64/ccache/bin:/usr/local/bin:/bin:/usr/bin

:/usr/X11R6/bin:/opt/java/jdk1.5.0/jre/bin:/opt/visit/bin:/home/joeb/bi
        n:/opt/mpich/p4-gnu/sbin,PBS_O_MAIL=/var/spool/mail/joeb
        PBS_O_SHELL=/bin/bash,PBS_O_HOST=excalibur.example.com,
        PBS_O_WORKDIR=/home/joeb,PBS_O_QUEUE=default
    comment = Job started on Wed Dec 06 at 08:54
    etime = Wed Dec  6 08:54:16 2006
---------------------------------------------------------------------------------

However, the script output looks like this:

Job ID: 76.excalibur.example.com
Working directory is /home/joeb
Running on host n04.example.com
Time is Wed Dec 6 08:54:17 PST 2006
Directory is /home/joeb
The node file is /net/fs/home/joeb/local_nodefile
This job runs on the following processors:
n09.example.com:4 n10.example.com:4 n11.example.com:4 n12.example.com:4
This job has allocated 4 nodes/processors.

/usr/local/bin/mpich/x86_64/p4/gnu/bin/mpirun -nolocal -np 4 -machinefile
/net/fs/home/joeb/local_nodefile /usr/local/bin/mpich/p
4-gnu/examples/cpi

pi is approximately 3.1416009869231249, Error is 0.0000083333333318
wall clock time = 0.003906
---------------------------------------------------------------------------------

Can anyone explain why the output of qstat -f and the script echo statements
differ, and how can I determine which is correct?  (Short of sleeping for a
while while I look for all the processes?)

Thanks,
dave
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20061206/650b6d95/attachment.html


More information about the torqueusers mailing list