[torqueusers] submitted jobs not running all nodes

Chris Bright cbright at sci.utah.edu
Mon Jun 3 09:39:53 MDT 2013


Hello,

I'm sorry if I'm sending this to the wrong list, if so please forgive me 
and direct me to the appropriate place.

I'm running a coupe of clusters one 64 node cluster and one 4 node 
cluster utilizing the default torque package for scheduling and 
everything else. When I try to submit a job that will utilize more than 
one node it appears that it will not use all of the nodes, but rather it 
stays on one node. When I run tracejob <job-id> or qstat -f <job-id> it 
shows that the nodes have been allocated to the job and everything 
appears to be fine. If I go to the nodes individually and run top or ps 
-ef the job will only appear on one node and use only the processors of 
that node.

Does anyone have any idea what may be causing this behavior?

Here is view of my qmgr.
#
# Create queues and set their attributes.
#
#
# Create and define queue batch
#
create queue batch
set queue batch queue_type = Execution
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime = 01:00:00
set queue batch enabled = True
set queue batch started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts =  <this is a valid hostname>
set server default_queue = batch
set server log_events = 511
set server mail_from = adm
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server mom_job_sync = True
set server next_job_number = 7334

Thanks,
Chris Bright




More information about the torqueusers mailing list