[torqueusers] Only a fraction of jobs are being run

Nima Mohammadi nima.irt at gmail.com
Mon Mar 19 10:56:12 MDT 2012


Hi folks,
It's new year's eve (Nowruz) in my country and apparently I've got the
entire cluster to myself during holidays. So now that everyone is
celebrating, I was going to submit my job array of hundreds of jobs to
the queue. But unfortunately only a fraction of them run
simultaneously, and others get into queue waiting for running jobs to
get completed.

[mohammadi at server mohammadi]$ qstat
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
1845.server                ME3550.10.1.job  mohammadi       00:04:19 C
batch
1847.server                ME3550.10.3.job  mohammadi       00:04:18 C
batch
1848.server                ME3550.20.1.job  mohammadi       00:04:19 C
batch
1850.server                ME3550.20.3.job  mohammadi       00:04:20 C
batch
1851.server                ME3550.30.1.job  mohammadi              0 R
batch
1852.server                ME3550.30.2.job  mohammadi              0 Q
batch
1853.server                ME3550.30.3.job  mohammadi              0 Q
batch
1854.server                ME3560.10.1.job  mohammadi              0 Q
batch
....
1961.server                ME3770.30.3.job  mohammadi              0 R
batch
1962.server                ME3780.10.1.job  mohammadi              0 Q
batch
1963.server                ME3780.10.2.job  mohammadi              0 R
batch
1964.server                ME3780.10.3.job  mohammadi              0 Q batch
....

At first, I got suspicious that a slot limit is set, but checking the
configurations with qmgr showed no max_slot_limit:

Qmgr: list server
Server server
    server_state = Active
    scheduling = True
    total_jobs = 213
    state_count = Transit:0 Queued:203 Held:0 Waiting:0 Running:10 Exiting:0
    acl_hosts = server
    acl_roots = root@*
    managers = cartoonist at server,ghader at server,mohammadi at server,
                   root at server
    operators = cartoonist at server,ghader at server,mohammadi at server,
                    root at server,seyedi at server
    default_queue = batch
    log_events = 511
    mail_from = adm
    resources_assigned.mem = 0b
    resources_assigned.nodect = 10
    scheduler_iteration = 600
    node_check_rate = 150
    tcp_timeout = 6
    mom_job_sync = True
    pbs_version = 3.0.1
    keep_completed = 10
    next_job_number = 2070
    net_counter = 2 1 0
    record_job_info = True
    job_log_file_max_size = 10000
    job_log_file_roll_depth = 5
    job_log_keep_days = 10

Queue batch
	queue_type = Execution
	total_jobs = 213
	state_count = Transit:0 Queued:202 Held:0 Waiting:0 Running:10 Exiting:0
	resources_max.cput = 20000:00:00
	resources_min.cput = 00:00:01
	resources_default.cput = 10000:00:00
	resources_default.nodes = 2
	resources_default.walltime = 100:00:00
	mtime = Mon Mar 19 18:53:51 2012
	resources_assigned.mem = 0b
	resources_assigned.nodect = 10
	enabled = True
	started = True


Checking with the pbsnodes command, there are 14 nodes up and running
with 246 processors. My problem is an 'embarrassingly parallel'
workload and there's no dependency among the jobs.
The batch scripts are generated using the Python script below:


from subprocess import call
import os
script='''
#!/bin/sh
#PBS -l nodes=1:ppn=1,walltime=00:06:00
#PBS -o /dev/null
#PBS -e /dev/null
#PBS -q batch
#PBS -M nima.irt at gmail.com
#PBS -m abe

source /share/mohammadi/nima/mydevenv/bin/activate
cd /share/mohammadi/nima/AI/
python ME-cluster.py %d %d %d %2.1f %2.1f
'''
total_experts = 3
for gating_hidden in xrange (5, 10):
   for experts_hidden in xrange(5, 10):
      for gating_N in [x * 0.1 for x in range(1, 4)]:
         for experts_N in [x * 0.1 for x in range(1, 4)]:
            script_name = '/tmp/ME%d%d%d%2.1f%2.1f.job' % (total_experts,
		gating_hidden, experts_hidden, gating_N, experts_N)
            with open(script_name,'w') as scriptf:
               scriptf.write(script % (total_experts, gating_hidden,
		experts_hidden, gating_N, experts_N))
            call(["qsub", script_name])

Any help would be highly appreciated :)

-- Nima Mohammadi


More information about the torqueusers mailing list