[torqueusers] Job with high proc count will not schedule

Jonathan K Shelley Jonathan.Shelley at inl.gov
Tue Mar 2 17:32:30 MST 2010


I have a 5 node cluster with 112 cores. I just installed torque 2.4.6. It 
seems to be working but when I submit the following.

qsub -I -l nodes=32
qsub: waiting for job 551.eos.inel.gov to start

I try a qrun and I get the following:

eos:/opt/torque/sbin # qrun 551
qrun: Resource temporarily unavailable MSG=job allocation request exceeds 
currently available cluster nodes, 32 requested, 5 available 
551.eos.inel.gov

but it never schedules. I saw in the documentation that I needed to set 
the resources_availbale.nodect to a high number so I did.

when I run printserverdb I get:

eos:/opt/torque/sbin # printserverdb
---------------------------------------------------
numjobs:                0
numque:         1
jobidnumber:            552
sametm:         1267574146
--attributes--
total_jobs = 1
state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:0 Exiting:0
default_queue = all
log_events = 511
mail_from = adm
query_other_jobs = True
resources_available.nodect = 2048
scheduler_iteration = 600
node_check_rate = 150
tcp_timeout = 6
pbs_version = 2.4.6
next_job_number = 551
net_counter = 3 0 0

eos:/opt/torque/sbin # qmgr -c "p s"
#
# Create queues and set their attributes.
#
#
# Create and define queue all
#
create queue all
set queue all queue_type = Execution
set queue all resources_max.walltime = 672:00:00
set queue all resources_available.nodect = 2048
set queue all enabled = True
set queue all started = True
#
# Set server attributes.
#
set server acl_hosts = eos
set server managers = awm at eos.inel.gov
set server managers += lucads2 at eos.inel.gov
set server managers += poolrl at eos.inel.gov
set server managers += sheljk at eos.inel.gov
set server default_queue = all
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server resources_available.nodect = 2048
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server next_job_number = 552

Any ideas what I need to do to get this working?

Thanks,

Jon Shelley
HPC Software Consultant
Idaho National Lab
Phone (208) 526-9834
Fax (208) 526-0122
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20100302/5322891e/attachment.html 


More information about the torqueusers mailing list