[torqueusers] submitted jobs not running on all nodes
cbright at sci.utah.edu
Wed Jun 5 13:25:22 MDT 2013
I'm running a couple of clusters, one 64 node cluster and one 5 node
cluster, utilizing the default torque package for scheduling and
everything else. When I try to submit a job that will utilize more than
one node it appears that it will not use all of the nodes, but rather it
stays on one node. When I run tracejob <job-id> or qstat -f <job-id> it
shows that the nodes have been allocated to the job and everything
appears to be fine. If I go to the nodes individually They have the
appropriate job files in the mom directory, but if I run top or ps -ef
the job will only appear on one node and use only the processors of that
node while not showing up in any of the other nodes it has been set to use.
Does anyone have any idea what may be causing this behavior?
Here is view of my qmgr.
# Create queues and set their attributes.
# Create and define queue batch
create queue batch
set queue batch queue_type = Execution
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime = 01:00:00
set queue batch enabled = True
set queue batch started = True
# Set server attributes.
set server scheduling = True
set server acl_hosts = <this is a valid hostname>
set server default_queue = batch
set server log_events = 511
set server mail_from = adm
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server mom_job_sync = True
set server next_job_number = 7334
More information about the torqueusers