[torqueusers] pbsnodes reports the same job running many times

Leonardo Gregory Brunnet leon at if.ufrgs.br
Thu Apr 19 14:07:34 MDT 2012


David,

Thanks for the reply.

I have checked the qsub script, it seems that a single cpu is required
#PBS -l ncpus=1)
  but maybe the correct argument should be
  #PBS -l nodes=1

qsub script
**************************************
#PBS -j oe
#PBS -l ncpus=1
#PBS -q um_mes
#PBS -N dif_t170_u106_6

work_dir=$PBS_O_WORKDIR
cd $work_dir
./script_6
**************************************

But the  output for qstat -f (below) seems to indicate that  4 cpu's 
have been required...

Leonardo
*************************************
Job Id: 78898.master
     Job_Name = dif_u91_t170_1
     Job_Owner = andressa at master.cluster
     resources_used.cput = 5124169:16:44
     resources_used.mem = 6204kb
     resources_used.vmem = 52424kb
     resources_used.walltime = 76:15:25
     job_state = R
     queue = uma_semana
     server = master.cluster
     Checkpoint = u
     ctime = Mon Apr 16 12:35:53 2012
     Error_Path = master.cluster:/home103/andressa/ramp_system/dif_
     u91_t170_1.e78898
     exec_host = 
node131/3+node131/2+node131/1+node131/0+node123/2+node123/1+no
     de123/0
     Hold_Types = n
     Join_Path = oe
     Keep_Files = n
     Mail_Points = a
     mtime = Mon Apr 16 12:36:40 2012
     Output_Path = master.cluster:/home103/andressa/ramp_system/dif
     _u91_t170_1.o78898
     Priority = 0
     qtime = Mon Apr 16 12:35:53 2012
     Rerunable = True
     Resource_List.ncpus = 1
     Resource_List.neednodes = 7
     Resource_List.nodect = 7
     Resource_List.nodes = 7
     Resource_List.walltime = 168:00:00
     session_id = 8224
     substate = 42
     Variable_List = PBS_O_QUEUE=uma_semana,
     PBS_O_HOST=master.cluster,PBS_O_HOME=/home103/andressa,
     PBS_O_LANG=pt_BR.UTF-8,PBS_O_LOGNAME=andressa,
     PBS_O_PATH=/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/
     opt/intel/composer_xe_2011_sp1.6.233/bin/intel64/,
     PBS_O_MAIL=/var/mail/andressa,PBS_O_SHELL=/bin/bash,
     PBS_SERVER=master.cluster,
     PBS_O_WORKDIR=/home103/andressa/ramp_system
     euser = andressa
     egroup = 1030
     hashname = 78898.master.cluster
     queue_rank = 24
     queue_type = E
     etime = Mon Apr 16 12:35:53 2012
     submit_args = ./script_minuano_1
     start_time = Mon Apr 16 12:35:54 2012
     Walltime.Remaining = 330258
     start_count = 1
     fault_tolerant = False
     submit_host = master.cluster
     init_work_dir = /home103/andressa/ramp_system
****************************************

On 19-04-2012 12:32, David Beer wrote:
> What is the qstat -f output for this job? pbs_server reports dedicated 
> resources in pbsnodes output, so if the job requests 4 execution slots 
> and runs one process, pbs_server will report 4 execution slots as 
> occupied. Conversely, if a job asks for one execution slot and uses 4, 
> pbs_server will not scale up what it reports.
>
> David
>
> On Wed, Apr 18, 2012 at 4:26 PM, Leonardo Gregory Brunnet 
> <leon at if.ufrgs.br <mailto:leon at if.ufrgs.br>> wrote:
>
>     Dear All,
>
>     In a fresh installed torque/maui cluster the server reports
>     repeated execution of a job in a given  node. (There is no job running
>     mpi)!.
>
>     The output for pbsnodes for one given node gives:
>
>     node131
>          state = job-exclusive
>          np = 4
>          properties = quadcore
>          ntype = cluster
>          jobs = 0/78898.master.cluster.XX.XX.XX,
>     1/78898.master.cluster.XX.XX.XX, 2/78898.master.cluster.XX.XX.XX,
>     3/78898.master.XX.XX.XX
>          status =
>     rectime=1334786811,varattr=,jobs=78898.master.cluster.if.ufrgs.br
>     <http://78898.master.cluster.if.ufrgs.br>,state=free,netload=2914588064,gres=,loadave=1.00,ncpus=4,physmem=3985876kb,availmem=4649240kb,totmem=5062188kb,idletime=535832,nusers=2,nsessions=2,sessions=2804
>     8224,uname=Linux node131 2.6.23-1-amd64 #1 SMP Fri Oct 12 23:45:48 UTC
>     2007 x86_64,opsys=linux
>          gpus = 0
>
>     But, if we log in that node we will see what was expected, a
>     single job.
>     Since the torque server (or maui) "believes" all cpu's of that
>     node are
>     working,
>     no other jobs are sent.  Any clues ?
>
>     Thanks for the help!
>
>     Leonardo
>
>     Below, you find the output for
>     # qmgr -c "p s"
>
>     #
>     # Create queues and set their attributes.
>     #
>     #
>     # Create and define queue padrao
>     #
>     create queue padrao
>     set queue padrao queue_type = Execution
>     set queue padrao resources_default.nodes = 7
>     set queue padrao resources_default.walltime = 01:00:00
>     set queue padrao max_user_run = 5
>     set queue padrao enabled = True
>     set queue padrao started = True
>     #
>     # Create and define queue um_mes
>     #
>     create queue um_mes
>     set queue um_mes queue_type = Execution
>     set queue um_mes resources_max.nodes = 7
>     set queue um_mes resources_default.nodes = 7
>     set queue um_mes resources_default.walltime = 720:00:00
>     set queue um_mes max_user_run = 5
>     set queue um_mes enabled = True
>     set queue um_mes started = True
>     #
>     # Create and define queue batch
>     #
>     create queue batch
>     set queue batch queue_type = Execution
>     set queue batch resources_default.nodes = 1
>     set queue batch resources_default.walltime = 01:00:00
>     set queue batch enabled = True
>     set queue batch started = True
>     #
>     # Create and define queue um_dia
>     #
>     create queue um_dia
>     set queue um_dia queue_type = Execution
>     set queue um_dia resources_max.nodes = 7
>     set queue um_dia resources_default.nodes = 7
>     set queue um_dia resources_default.walltime = 24:00:00
>     set queue um_dia max_user_run = 7
>     set queue um_dia enabled = True
>     set queue um_dia started = True
>     #
>     # Create and define queue uma_semana
>     #
>     create queue uma_semana
>     set queue uma_semana queue_type = Execution
>     set queue uma_semana resources_max.nodes = 7
>     set queue uma_semana resources_default.nodes = 7
>     set queue uma_semana resources_default.walltime = 168:00:00
>     set queue uma_semana max_user_run = 5
>     set queue uma_semana enabled = True
>     set queue uma_semana started = True
>     #
>     # Create and define queue route
>     #
>     create queue route
>     set queue route queue_type = Route
>     set queue route route_destinations = padrao
>     set queue route route_destinations += padrao2
>     set queue route enabled = True
>     set queue route started = True
>     #
>     # Set server attributes.
>     #
>     set server scheduling = True
>     set server acl_hosts = master.cluster.XX.XX.XX
>     set server acl_hosts += clusterapg
>     set server managers = root at master.cluster.XX.XX.XX
>     set server operators = root at master.cluster.XX.XX.XX
>     set server default_queue = padrao
>     set server log_events = 511
>     set server mail_from = adm
>     set server scheduler_iteration = 600
>     set server node_check_rate = 150
>     set server tcp_timeout = 6
>     set server mom_job_sync = True
>     set server keep_completed = 300
>     set server next_job_number = 79033
>
>     --
>     Leonardo Gregory Brunnet                  E-mail: leon at if.ufrgs.br
>     <mailto:leon at if.ufrgs.br>
>     Instituto de Fisica - UFRGS http://pcleon.if.ufrgs.br
>     91501-970 Porto Alegre, RS, BRASIL        Phone: (51) 33 08 72 51
>     <tel:%2851%29%2033%2008%2072%2051>
>     FAX +55 51 33 08 72 86 <tel:%2B55%2051%2033%2008%2072%2086>      
>                   C.P. 15051
>     Linux User: 39314
>
>     _______________________________________________
>     torqueusers mailing list
>     torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
>     http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
>
> -- 
> David Beer | Software Engineer
> Adaptive Computing
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>    

-- 
Leonardo Gregory Brunnet                  E-mail: leon at if.ufrgs.br
Instituto de Fisica - UFRGS               http://pcleon.if.ufrgs.br
91501-970 Porto Alegre, RS, BRASIL        Phone: (51) 33 08 72 51
FAX +55 51 33 08 72 86                     C.P. 15051
Linux User: 39314

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120419/adaf9e26/attachment-0001.html 


More information about the torqueusers mailing list