[torqueusers] Wrong Exec Host

Regina Guilabert Canals regina.guilabert at uib.es
Tue Sep 25 01:31:17 MDT 2007


Dear TORQUE users,

TORQUE seems to ignore the number of nodes we request with -l nodes=3  
(or any number) and set the exec_host to fill just one node (both  
processors).

The particular problem is:

Script submitted via qsub un_proc.pbs:
#!/bin/tcsh
### Job name
#PBS -N TEST
### Declare job non-rerunable
##PBS -r n
### Max run time
#PBS -l walltime=00:20:30
### Mail to user
#PBS -m ae
### Queue name (small, medium, long, verylong)
#PBS -q batch
### Number of nodes (node property ev67 wanted)
#PBS -l nodes=3:ppn=2

# This job's working directory
echo Working directory is $PBS_O_WORKDIR
cd $PBS_O_WORKDIR

cat $PBS_NODEFILE

echo Running on host `hostname`
echo Time is `date`
echo Directory is `pwd`

### Run NON-PARALLEL PROGRAM

awk 'BEGIN {for(i=0;i<100000;i++)for(j=0;j<100000;j++);}'
###################### END of SCRIPT ######################


Report on qstat:
megacelula:~> qstat -n1

megacelula:
                                                                     
Req'd  Req'd   Elap
Job ID               Username Queue    Jobname    SessID NDS   TSK  
Memory Time  S Time
-------------------- -------- -------- ---------- ------ ----- ---  
------ ----- - -----
3105.megacelula      dfsvhs9  batch    TEST         6027     3  --     
--  00:20 R 00:05   cell8/1+cell8/0

Extended report on this job shows the wrong exec_host assigned:
megacelula:~> qstat -f 3105
Job Id: 3105.megacelula
     Job_Name = TEST
     Job_Owner = dfsvhs9 at megacelula
     resources_used.cput = 00:05:52
     resources_used.mem = 3204kb
     resources_used.vmem = 13796kb
     resources_used.walltime = 00:05:53
     job_state = R
     queue = batch
     server = megacelula
     Checkpoint = u
     ctime = Mon Sep 24 09:48:26 2007
     Error_Path = megacelula:/megadisk/people/dfsvhs9/TEST.e3105
     exec_host = cell8/1+cell8/0
     Hold_Types = n
     Join_Path = n
     Keep_Files = n
     Mail_Points = ae
     mtime = Mon Sep 24 09:48:27 2007
     Output_Path = megacelula:/megadisk/people/dfsvhs9/TEST.o3105
     Priority = 0
     qtime = Mon Sep 24 09:48:26 2007
     Rerunable = True
     Resource_List.nodect = 3
     Resource_List.nodes = 3:ppn=2
     Resource_List.walltime = 00:20:30
     session_id = 6027
     Variable_List = PBS_O_HOME=/megadisk/people/dfsvhs9,
         PBS_O_LANG=en_US.UTF-8,PBS_O_LOGNAME=dfsvhs9,
         PBS_O_PATH=/home/mcidas/bin:/usr/local/ncarg/bin:./:;:/usr/ 
local/bin:
         /usr/bin:/bin:/usr/bin/X11:/usr/games:;:/megadisk/people/ 
dfsvhs9/bin/b
         in:;:/usr/pgi/linux86/6.0/bin:;:/usr/local/mpich/bin,
         PBS_O_MAIL=/var/mail/dfsvhs9,PBS_O_SHELL=/bin/tcsh,
         PBS_O_HOST=megacelula,PBS_O_WORKDIR=/megadisk/people/dfsvhs9,
         PBS_O_QUEUE=batch
     comment = Job started on Mon Sep 24 at 09:48
     etime = Mon Sep 24 09:48:26 2007


Has anybody ever found this wierd behaviour?
We cannot use more than one node unless explicitly requested with "-l  
nodes=[list of nodes]". Can anyone provide a hint on how to fix or  
diagnose this problem further?

Thanks in advance.

PS: In case this helps, see server and queues configuration:
** Server configuration:
megacelula:~> qstat -Bf
Server: megacelula
     server_state = Active
     scheduling = True
     total_jobs = 1
     state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:1  
Exiting:0
     managers = torque at megacelula
     operators = torque at megacelula
     default_queue = batch
     log_events = 511
     mail_from = adm
     query_other_jobs = True
     resources_available.nodect = 28
     resources_default.nodes = 2:ppn2
     resources_assigned.nodect = 3
     scheduler_iteration = 100
     node_check_rate = 150
     tcp_timeout = 6
     node_pack = True
     pbs_version = 2.1.8

** Queue configuration:
megacelula:~> qstat -Qf
Queue: debug
     queue_type = Execution
     Priority = 5
     total_jobs = 0
     state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0  
Exiting:0
     max_running = 1
     resources_max.walltime = 00:10:00
     mtime = 1180373373
     resources_available.nodect = 2
     enabled = True
     started = True

Queue: batch
     queue_type = Execution
     total_jobs = 1
     state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:1  
Exiting:0
     resources_max.walltime = 48:00:00
     resources_default.walltime = 01:00:00
     mtime = 1185525866
     resources_available.nodect = 56
     resources_assigned.nodect = 3
     enabled = True
     started = True


Regina Guilabert Canals
Grup de Meteorologia

Edif. Mateu Orfila					Tel: +34 971 17 3213
Universitat de les Illes Balears		Fax: +34 971 17 3426
07122 Palma de Mallorca (Spain) 	email: regina.guilabert at uib.es



-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20070925/9315f265/attachment-0001.html


More information about the torqueusers mailing list