[torqueusers] Job with high proc count will not schedule

Ken Nielson knielson at adaptivecomputing.com
Thu Mar 4 13:52:12 MST 2010


Jonathan,

You will only need an entry for each node in your cluster. But you will 
want to designate the number of processes you will allow to run on the 
node. Your server_priv/nodes file will have an entry like the following 
for each host in your cluster that is running a pbs_mom.

node_name np=4

node_name is just the host name of the node where the MOM is running and 
np=4 tells TORQUE that a maximum of four processes can run on this node. 
I think the convention is that np should be equal to the number of cores 
on the node. However, it really is the number of processes allowed to 
run on that particular node. So even if I only have four cores and can 
still set np to 8 or 10 or even 100. TORQUE will schedule as many jobs 
as there are processes available.

Ken Nielson
Adaptive Computing


Jonathan K Shelley wrote:
> When I did what you recommended
>
> qsub -I -l procs=48
>
> my node file has only one entry in it
>
> eos { ~ }$ cat $PBS_NODEFILE
> eos
>
> I need a node file with one entry for each processor. I also want to 
> be able to specify chunks of resources (ie nodes=6:ppn=4) since I have 
> some 4 and 8 core machines and I don't want to get less than four 
> procs on a machine.
>
> Reading the admin sdocumentation below from section 10.1.7 as listed 
> below suggests that I s
> qsub will not allow the submission of jobs requesting many processors
> TORQUE's definition of a node is context sensitive and can appear 
> inconsistent. The qsub '-l
> nodes=<X>' expression can at times indicate a request for X processors 
> and other time be
> interpreted as a request for X nodes. While qsub allows multiple 
> interpretations of the keyword
> nodes, aspects of the TORQUE server's logic are not so flexible. 
> Consequently, if a job is using '-
> l nodes' to specify processor count and the requested number of 
> processors exceeds the available
> number of physical nodes, the server daemon will reject the job.
> To get around this issue, the server can be told it has an inflated 
> number of nodes using the
> resources_available attribute. To take affect, this attribute should 
> be set on both the server and
> the associated queue as in the example below. See resources_available 
> for more information.
>
> > qmgr
> Qmgr: set server resources_available.nodect=2048
> Qmgr: set queue batch resources_available.nodect=2048
>
> NOTE: The pbs_server daemon will need to be restarted before these 
> changes will take affect.
>
> Any Ideas?
>
> Thanks,
>
> Jon Shelley
> HPC Software Consultant
> Idaho National Lab
> Phone (208) 526-9834
> Fax (208) 526-0122
>
>
>
> *Roman Baranowski <roman at chem.ubc.ca>*
> Sent by: torqueusers-bounces at supercluster.org
>
> 03/02/2010 06:49 PM
>
> 	
> To
> 	Jonathan K Shelley <Jonathan.Shelley at inl.gov>
> cc
> 	torqueusers <torqueusers at supercluster.org>
> Subject
> 	Re: [torqueusers] Job with high proc count will not schedule
>
>
>
> 	
>
>
>
>
>
>
>                  Dear Jonathan,
>
> You have 5 nodes only so bumping up the resources_availbale.nodect with
> qmgr will never work, have you tried
>                  qsub -I -l procs=112
>
>                  All the best
>                  Roman
>
>
> On Tue, 2 Mar 2010, Jonathan K Shelley wrote:
>
> > I have a 5 node cluster with 112 cores. I just installed torque 
> 2.4.6. It seems to be working but when
> > I submit the following.
> >
> > qsub -I -l nodes=32
> > qsub: waiting for job 551.eos.inel.gov to start
> >
> > I try a qrun and I get the following:
> >
> > eos:/opt/torque/sbin # qrun 551
> > qrun: Resource temporarily unavailable MSG=job allocation request 
> exceeds currently available cluster
> > nodes, 32 requested, 5 available 551.eos.inel.gov
> >
> > but it never schedules. I saw in the documentation that I needed to 
> set the resources_availbale.nodect
> > to a high number so I did.
> >
> > when I run printserverdb I get:
> >
> > eos:/opt/torque/sbin # printserverdb
> > ---------------------------------------------------
> > numjobs:                0
> > numque:         1
> > jobidnumber:            552
> > sametm:         1267574146
> > --attributes--
> > total_jobs = 1
> > state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:0 Exiting:0
> > default_queue = all
> > log_events = 511
> > mail_from = adm
> > query_other_jobs = True
> > resources_available.nodect = 2048
> > scheduler_iteration = 600
> > node_check_rate = 150
> > tcp_timeout = 6
> > pbs_version = 2.4.6
> > next_job_number = 551
> > net_counter = 3 0 0
> >
> > eos:/opt/torque/sbin # qmgr -c "p s"
> > #
> > # Create queues and set their attributes.
> > #
> > #
> > # Create and define queue all
> > #
> > create queue all
> > set queue all queue_type = Execution
> > set queue all resources_max.walltime = 672:00:00
> > set queue all resources_available.nodect = 2048
> > set queue all enabled = True
> > set queue all started = True
> > #
> > # Set server attributes.
> > #
> > set server acl_hosts = eos
> > set server managers = awm at eos.inel.gov
> > set server managers += lucads2 at eos.inel.gov
> > set server managers += poolrl at eos.inel.gov
> > set server managers += ''@eos.inel.gov
> > set server default_queue = all
> > set server log_events = 511
> > set server mail_from = adm
> > set server query_other_jobs = True
> > set server resources_available.nodect = 2048
> > set server scheduler_iteration = 600
> > set server node_check_rate = 150
> > set server tcp_timeout = 6
> > set server next_job_number = 552
> >
> > Any ideas what I need to do to get this working?
> >
> > Thanks,
> >
> > Jon Shelley
> > HPC Software Consultant
> > Idaho National Lab
> > Phone (208) 526-9834
> > Fax (208) 526-0122
> >
> >_______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>   



More information about the torqueusers mailing list