[torqueusers] Jobs limited to one per node
knielson at adaptivecomputing.com
Thu Jun 7 10:00:04 MDT 2012
On Tue, May 15, 2012 at 9:12 AM, Josh Nielsen <jnielsen at hudsonalpha.com>wrote:
> I noticed recently on our Torque cluster (3.0.2) that it is only allowing
> one job per node and it is only assigning one CPU core for each job even
> though there are eight per node (so it is not maxing out the resources -
> and is wasting/not utilizing seven cores per node). After looking around
> for a while I found a comment elsewhere on this mailing list about
> compiling torque with the --enable-cpuset flag. I read the Torque page
> about cpusets but am none the wiser about whether that is a required
> feature to allow, what I would have thought to be default functionality of
> allowing, more than one process/job to run on a node (and to utilize more
> than one core per job).
npp is getting treated as a feature and you do not have that as a feature.
What you really want is ppn.
echo "sleep 60; echo test" | qsub -l nodes=1:ppn=1
This should fix your problem.
> If I specify any npp=* value with qsub, even if only one (like echo "sleep
> 60; echo test" | qsub -l nodes=1:npp=1), I get the message "qsub: Job
> exceeds queue resource limits MSG=cannot locate feasible nodes". And during
> the course of scheduling jobs, once there are more jobs requested than
> there are nodes then they are listed as queued and in the sched_log/ log
> files I see "Not enough of the right type of nodes available" for each new
> request. I also tried adding np=8 to each of the nodes listed in
> server_priv/nodes since I had not before, but it did not change anything.
> I will post my Torque config below, but I'm curious to know if
> --enable-cpuset is what I need, since it is not made explicit that it is a
> required flag to allow more than one job to run per node. Setting the
> default and max settings was my attempt to get this working, although we
> have another cluster that doesn't specify any of that and it runs as
> expected by reserving the amount of cpus per node that you request with npp.
> qmgr -c "print server"
> # Create queues and set their attributes.
> # Create and define queue batch
> create queue batch
> set queue batch queue_type = Execution
> set queue batch resources_max.ncpus = 8
> set queue batch resources_max.nodes = 2
> set queue batch resources_min.ncpus = 1
> set queue batch resources_default.ncpus = 1
> set queue batch resources_default.nodect = 1
> set queue batch resources_default.nodes = 1
> set queue batch resources_default.walltime = 32:00:00
> set queue batch enabled = True
> set queue batch started = True
> # Set server attributes.
> set server scheduling = True
> set server acl_hosts = penguin-head01.compute.haib.org
> set server managers = root at penguin-head01.compute.haib.org
> set server operators = root at penguin-head01.compute.haib.org
> set server default_queue = batch
> set server log_events = 511
> set server mail_from = adm
> set server scheduler_iteration = 600
> set server node_check_rate = 150
> set server tcp_timeout = 6
> set server mom_job_sync = True
> set server keep_completed = 300
> set server next_job_number = 554
> qmgr -c "list server"
> Server penguin-head01.compute.haib.org
> server_state = Active
> scheduling = True
> total_jobs = 0
> state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0
> acl_hosts = penguin-head01.compute.haib.org
> managers = root at penguin-head01.compute.haib.org
> operators = root at penguin-head01.compute.haib.org
> default_queue = batch
> log_events = 511
> mail_from = adm
> resources_assigned.ncpus = 0
> resources_assigned.nodect = 0
> scheduler_iteration = 600
> node_check_rate = 150
> tcp_timeout = 6
> mom_job_sync = True
> pbs_version = 3.0.2
> keep_completed = 300
> next_job_number = 554
> net_counter = 2 0 0
> Any suggestions would be appreciated!
> torqueusers mailing list
> torqueusers at supercluster.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the torqueusers