[torqueusers] Wrong Exec Host
scoggins
jscoggins at lbl.gov
Tue Sep 25 10:33:36 MDT 2007
The question is what scheduler are you using and how is your
scheduler configured?
On Sep 25, 2007, at 12:31 AM, Regina Guilabert Canals wrote:
> Dear TORQUE users,
>
> TORQUE seems to ignore the number of nodes we request with -l
> nodes=3 (or any number) and set the exec_host to fill just one node
> (both processors).
>
> The particular problem is:
>
> Script submitted via qsub un_proc.pbs:
> #!/bin/tcsh
> ### Job name
> #PBS -N TEST
> ### Declare job non-rerunable
> ##PBS -r n
> ### Max run time
> #PBS -l walltime=00:20:30
> ### Mail to user
> #PBS -m ae
> ### Queue name (small, medium, long, verylong)
> #PBS -q batch
> ### Number of nodes (node property ev67 wanted)
> #PBS -l nodes=3:ppn=2
>
> # This job's working directory
> echo Working directory is $PBS_O_WORKDIR
> cd $PBS_O_WORKDIR
>
> cat $PBS_NODEFILE
>
> echo Running on host `hostname`
> echo Time is `date`
> echo Directory is `pwd`
>
> ### Run NON-PARALLEL PROGRAM
>
> awk 'BEGIN {for(i=0;i<100000;i++)for(j=0;j<100000;j++);}'
> ###################### END of SCRIPT ######################
>
>
> Report on qstat:
> megacelula:~> qstat -n1
>
> megacelula:
>
> Req'd Req'd Elap
> Job ID Username Queue Jobname SessID NDS TSK
> Memory Time S Time
> -------------------- -------- -------- ---------- ------ ----- ---
> ------ ----- - -----
> 3105.megacelula dfsvhs9 batch TEST 6027 3
> -- -- 00:20 R 00:05 cell8/1+cell8/0
>
> Extended report on this job shows the wrong exec_host assigned:
> megacelula:~> qstat -f 3105
> Job Id: 3105.megacelula
> Job_Name = TEST
> Job_Owner = dfsvhs9 at megacelula
> resources_used.cput = 00:05:52
> resources_used.mem = 3204kb
> resources_used.vmem = 13796kb
> resources_used.walltime = 00:05:53
> job_state = R
> queue = batch
> server = megacelula
> Checkpoint = u
> ctime = Mon Sep 24 09:48:26 2007
> Error_Path = megacelula:/megadisk/people/dfsvhs9/TEST.e3105
> exec_host = cell8/1+cell8/0
> Hold_Types = n
> Join_Path = n
> Keep_Files = n
> Mail_Points = ae
> mtime = Mon Sep 24 09:48:27 2007
> Output_Path = megacelula:/megadisk/people/dfsvhs9/TEST.o3105
> Priority = 0
> qtime = Mon Sep 24 09:48:26 2007
> Rerunable = True
> Resource_List.nodect = 3
> Resource_List.nodes = 3:ppn=2
> Resource_List.walltime = 00:20:30
> session_id = 6027
> Variable_List = PBS_O_HOME=/megadisk/people/dfsvhs9,
> PBS_O_LANG=en_US.UTF-8,PBS_O_LOGNAME=dfsvhs9,
> PBS_O_PATH=/home/mcidas/bin:/usr/local/ncarg/bin:./:;:/usr/
> local/bin:
> /usr/bin:/bin:/usr/bin/X11:/usr/games:;:/megadisk/people/
> dfsvhs9/bin/b
> in:;:/usr/pgi/linux86/6.0/bin:;:/usr/local/mpich/bin,
> PBS_O_MAIL=/var/mail/dfsvhs9,PBS_O_SHELL=/bin/tcsh,
> PBS_O_HOST=megacelula,PBS_O_WORKDIR=/megadisk/people/dfsvhs9,
> PBS_O_QUEUE=batch
> comment = Job started on Mon Sep 24 at 09:48
> etime = Mon Sep 24 09:48:26 2007
>
>
> Has anybody ever found this wierd behaviour?
> We cannot use more than one node unless explicitly requested with "-
> l nodes=[list of nodes]". Can anyone provide a hint on how to fix
> or diagnose this problem further?
>
> Thanks in advance.
>
> PS: In case this helps, see server and queues configuration:
> ** Server configuration:
> megacelula:~> qstat -Bf
> Server: megacelula
> server_state = Active
> scheduling = True
> total_jobs = 1
> state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:1
> Exiting:0
> managers = torque at megacelula
> operators = torque at megacelula
> default_queue = batch
> log_events = 511
> mail_from = adm
> query_other_jobs = True
> resources_available.nodect = 28
> resources_default.nodes = 2:ppn2
> resources_assigned.nodect = 3
> scheduler_iteration = 100
> node_check_rate = 150
> tcp_timeout = 6
> node_pack = True
> pbs_version = 2.1.8
>
> ** Queue configuration:
> megacelula:~> qstat -Qf
> Queue: debug
> queue_type = Execution
> Priority = 5
> total_jobs = 0
> state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0
> Exiting:0
> max_running = 1
> resources_max.walltime = 00:10:00
> mtime = 1180373373
> resources_available.nodect = 2
> enabled = True
> started = True
>
> Queue: batch
> queue_type = Execution
> total_jobs = 1
> state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:1
> Exiting:0
> resources_max.walltime = 48:00:00
> resources_default.walltime = 01:00:00
> mtime = 1185525866
> resources_available.nodect = 56
> resources_assigned.nodect = 3
> enabled = True
> started = True
>
>
> Regina Guilabert Canals
> Grup de Meteorologia
>
> Edif. Mateu Orfila Tel: +34 971 17 3213
> Universitat de les Illes Balears Fax: +34 971 17 3426
> 07122 Palma de Mallorca (Spain) email: regina.guilabert at uib.es
>
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20070925/71ecaa4a/attachment-0001.html
More information about the torqueusers
mailing list