[torqueusers] submitted jobs not running on all nodes

Chris Bright cbright at sci.utah.edu
Tue Jun 4 08:29:08 MDT 2013


I'm running a coupe of clusters one 64 node cluster and one 5 node 
cluster utilizing the default torque package for scheduling and 
everything else. When I try to submit a job that will utilize more than 
one node it appears that it will not use all of the nodes, but rather it 
stays on one node. When I run tracejob <job-id> or qstat -f <job-id> it 
shows that the nodes have been allocated to the job and everything 
appears to be fine. If I go to the nodes individually They have the 
appropriate job files in the mom directory, but if I run top or ps -ef 
the job will only appear on one node and use only the processors of that 
node while not showing up in any of the other nodes it has been set to use.

Does anyone have any idea what may be causing this behavior?

Here is view of my qmgr.
# Create queues and set their attributes.
# Create and define queue batch
create queue batch
set queue batch queue_type = Execution
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime = 01:00:00
set queue batch enabled = True
set queue batch started = True
# Set server attributes.
set server scheduling = True
set server acl_hosts =  <this is a valid hostname>
set server default_queue = batch
set server log_events = 511
set server mail_from = adm
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server mom_job_sync = True
set server next_job_number = 7334

Chris Bright

More information about the torqueusers mailing list