[torqueusers] Job with high proc count will not schedule

Jonathan K Shelley Jonathan.Shelley at inl.gov
Wed Mar 3 09:16:06 MST 2010


When I did what you recommended 

qsub -I -l procs=48

my node file has only one entry in it

eos { ~ }$ cat $PBS_NODEFILE
eos

I need a node file with one entry for each processor. I also want to be 
able to specify chunks of resources (ie nodes=6:ppn=4) since I have some 4 
and 8 core machines and I don't want to get less than four procs on a 
machine.

Reading the admin sdocumentation below from section 10.1.7 as listed below 
suggests that I s
qsub will not allow the submission of jobs requesting many processors
TORQUE's definition of a node is context sensitive and can appear 
inconsistent. The qsub '-l
nodes=<X>' expression can at times indicate a request for X processors and 
other time be
interpreted as a request for X nodes. While qsub allows multiple 
interpretations of the keyword
nodes, aspects of the TORQUE server's logic are not so flexible. 
Consequently, if a job is using '-
l nodes' to specify processor count and the requested number of processors 
exceeds the available
number of physical nodes, the server daemon will reject the job.
To get around this issue, the server can be told it has an inflated number 
of nodes using the
resources_available attribute. To take affect, this attribute should be 
set on both the server and
the associated queue as in the example below. See resources_available for 
more information.

> qmgr
Qmgr: set server resources_available.nodect=2048
Qmgr: set queue batch resources_available.nodect=2048

NOTE: The pbs_server daemon will need to be restarted before these changes 
will take affect.

Any Ideas?

Thanks,

Jon Shelley
HPC Software Consultant
Idaho National Lab
Phone (208) 526-9834
Fax (208) 526-0122




Roman Baranowski <roman at chem.ubc.ca> 
Sent by: torqueusers-bounces at supercluster.org
03/02/2010 06:49 PM

To
Jonathan K Shelley <Jonathan.Shelley at inl.gov>
cc
torqueusers <torqueusers at supercluster.org>
Subject
Re: [torqueusers] Job with high proc count will not schedule







                 Dear Jonathan,

You have 5 nodes only so bumping up the resources_availbale.nodect with 
qmgr will never work, have you tried
                 qsub -I -l procs=112

                 All the best
                 Roman


On Tue, 2 Mar 2010, Jonathan K Shelley wrote:

> I have a 5 node cluster with 112 cores. I just installed torque 2.4.6. 
It seems to be working but when
> I submit the following.
> 
> qsub -I -l nodes=32
> qsub: waiting for job 551.eos.inel.gov to start
> 
> I try a qrun and I get the following:
> 
> eos:/opt/torque/sbin # qrun 551
> qrun: Resource temporarily unavailable MSG=job allocation request 
exceeds currently available cluster
> nodes, 32 requested, 5 available 551.eos.inel.gov
> 
> but it never schedules. I saw in the documentation that I needed to set 
the resources_availbale.nodect
> to a high number so I did.
> 
> when I run printserverdb I get:
> 
> eos:/opt/torque/sbin # printserverdb
> ---------------------------------------------------
> numjobs:                0
> numque:         1
> jobidnumber:            552
> sametm:         1267574146
> --attributes--
> total_jobs = 1
> state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:0 Exiting:0
> default_queue = all
> log_events = 511
> mail_from = adm
> query_other_jobs = True
> resources_available.nodect = 2048
> scheduler_iteration = 600
> node_check_rate = 150
> tcp_timeout = 6
> pbs_version = 2.4.6
> next_job_number = 551
> net_counter = 3 0 0
> 
> eos:/opt/torque/sbin # qmgr -c "p s"
> #
> # Create queues and set their attributes.
> #
> #
> # Create and define queue all
> #
> create queue all
> set queue all queue_type = Execution
> set queue all resources_max.walltime = 672:00:00
> set queue all resources_available.nodect = 2048
> set queue all enabled = True
> set queue all started = True
> #
> # Set server attributes.
> #
> set server acl_hosts = eos
> set server managers = awm at eos.inel.gov
> set server managers += lucads2 at eos.inel.gov
> set server managers += poolrl at eos.inel.gov
> set server managers += ''@eos.inel.gov
> set server default_queue = all
> set server log_events = 511
> set server mail_from = adm
> set server query_other_jobs = True
> set server resources_available.nodect = 2048
> set server scheduler_iteration = 600
> set server node_check_rate = 150
> set server tcp_timeout = 6
> set server next_job_number = 552
> 
> Any ideas what I need to do to get this working?
> 
> Thanks,
> 
> Jon Shelley
> HPC Software Consultant
> Idaho National Lab
> Phone (208) 526-9834
> Fax (208) 526-0122
> 
>_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20100303/4201f0bf/attachment-0001.html 


More information about the torqueusers mailing list