[torqueusers] Wrong Exec Host

scoggins jscoggins at lbl.gov
Tue Sep 25 10:33:36 MDT 2007


The question is what scheduler are you using and how is your  
scheduler configured?


On Sep 25, 2007, at 12:31 AM, Regina Guilabert Canals wrote:

> Dear TORQUE users,
>
> TORQUE seems to ignore the number of nodes we request with -l  
> nodes=3 (or any number) and set the exec_host to fill just one node  
> (both processors).
>
> The particular problem is:
>
> Script submitted via qsub un_proc.pbs:
> #!/bin/tcsh
> ### Job name
> #PBS -N TEST
> ### Declare job non-rerunable
> ##PBS -r n
> ### Max run time
> #PBS -l walltime=00:20:30
> ### Mail to user
> #PBS -m ae
> ### Queue name (small, medium, long, verylong)
> #PBS -q batch
> ### Number of nodes (node property ev67 wanted)
> #PBS -l nodes=3:ppn=2
>
> # This job's working directory
> echo Working directory is $PBS_O_WORKDIR
> cd $PBS_O_WORKDIR
>
> cat $PBS_NODEFILE
>
> echo Running on host `hostname`
> echo Time is `date`
> echo Directory is `pwd`
>
> ### Run NON-PARALLEL PROGRAM
>
> awk 'BEGIN {for(i=0;i<100000;i++)for(j=0;j<100000;j++);}'
> ###################### END of SCRIPT ######################
>
>
> Report on qstat:
> megacelula:~> qstat -n1
>
> megacelula:
>                                                                     
> Req'd  Req'd   Elap
> Job ID               Username Queue    Jobname    SessID NDS   TSK  
> Memory Time  S Time
> -------------------- -------- -------- ---------- ------ ----- ---  
> ------ ----- - -----
> 3105.megacelula      dfsvhs9  batch    TEST         6027     3   
> --    --  00:20 R 00:05   cell8/1+cell8/0
>
> Extended report on this job shows the wrong exec_host assigned:
> megacelula:~> qstat -f 3105
> Job Id: 3105.megacelula
>     Job_Name = TEST
>     Job_Owner = dfsvhs9 at megacelula
>     resources_used.cput = 00:05:52
>     resources_used.mem = 3204kb
>     resources_used.vmem = 13796kb
>     resources_used.walltime = 00:05:53
>     job_state = R
>     queue = batch
>     server = megacelula
>     Checkpoint = u
>     ctime = Mon Sep 24 09:48:26 2007
>     Error_Path = megacelula:/megadisk/people/dfsvhs9/TEST.e3105
>     exec_host = cell8/1+cell8/0
>     Hold_Types = n
>     Join_Path = n
>     Keep_Files = n
>     Mail_Points = ae
>     mtime = Mon Sep 24 09:48:27 2007
>     Output_Path = megacelula:/megadisk/people/dfsvhs9/TEST.o3105
>     Priority = 0
>     qtime = Mon Sep 24 09:48:26 2007
>     Rerunable = True
>     Resource_List.nodect = 3
>     Resource_List.nodes = 3:ppn=2
>     Resource_List.walltime = 00:20:30
>     session_id = 6027
>     Variable_List = PBS_O_HOME=/megadisk/people/dfsvhs9,
>         PBS_O_LANG=en_US.UTF-8,PBS_O_LOGNAME=dfsvhs9,
>         PBS_O_PATH=/home/mcidas/bin:/usr/local/ncarg/bin:./:;:/usr/ 
> local/bin:
>         /usr/bin:/bin:/usr/bin/X11:/usr/games:;:/megadisk/people/ 
> dfsvhs9/bin/b
>         in:;:/usr/pgi/linux86/6.0/bin:;:/usr/local/mpich/bin,
>         PBS_O_MAIL=/var/mail/dfsvhs9,PBS_O_SHELL=/bin/tcsh,
>         PBS_O_HOST=megacelula,PBS_O_WORKDIR=/megadisk/people/dfsvhs9,
>         PBS_O_QUEUE=batch
>     comment = Job started on Mon Sep 24 at 09:48
>     etime = Mon Sep 24 09:48:26 2007
>
>
> Has anybody ever found this wierd behaviour?
> We cannot use more than one node unless explicitly requested with "- 
> l nodes=[list of nodes]". Can anyone provide a hint on how to fix  
> or diagnose this problem further?
>
> Thanks in advance.
>
> PS: In case this helps, see server and queues configuration:
> ** Server configuration:
> megacelula:~> qstat -Bf
> Server: megacelula
>     server_state = Active
>     scheduling = True
>     total_jobs = 1
>     state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:1  
> Exiting:0
>     managers = torque at megacelula
>     operators = torque at megacelula
>     default_queue = batch
>     log_events = 511
>     mail_from = adm
>     query_other_jobs = True
>     resources_available.nodect = 28
>     resources_default.nodes = 2:ppn2
>     resources_assigned.nodect = 3
>     scheduler_iteration = 100
>     node_check_rate = 150
>     tcp_timeout = 6
>     node_pack = True
>     pbs_version = 2.1.8
>
> ** Queue configuration:
> megacelula:~> qstat -Qf
> Queue: debug
>     queue_type = Execution
>     Priority = 5
>     total_jobs = 0
>     state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0  
> Exiting:0
>     max_running = 1
>     resources_max.walltime = 00:10:00
>     mtime = 1180373373
>     resources_available.nodect = 2
>     enabled = True
>     started = True
>
> Queue: batch
>     queue_type = Execution
>     total_jobs = 1
>     state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:1  
> Exiting:0
>     resources_max.walltime = 48:00:00
>     resources_default.walltime = 01:00:00
>     mtime = 1185525866
>     resources_available.nodect = 56
>     resources_assigned.nodect = 3
>     enabled = True
>     started = True
>
>
> Regina Guilabert Canals
> Grup de Meteorologia
>
> Edif. Mateu Orfila					Tel: +34 971 17 3213
> Universitat de les Illes Balears		Fax: +34 971 17 3426
> 07122 Palma de Mallorca (Spain) 	email: regina.guilabert at uib.es
>
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20070925/71ecaa4a/attachment-0001.html


More information about the torqueusers mailing list