[torqueusers] torque not scaling well
Miles O'Neal
meo at intrinsity.com
Tue Jul 31 17:44:12 MDT 2007
We switched from pbs with the server and
scheduler running on Solaris 7 on a 300MHz
Sparc to torque/maui running CentOS 4.4 on
a 2.16GHz Core 2 Duo. But we still start
seeing slowdows in job run rates somewhere
between 1000 and 1500 jobs queued, and once
we get up around 3K jobs queued, forget it.
We have several users who might run scripts
doing multiple qsubs simultaneously, and
this makes it worse. We currently have about
250 client nodes, but during the daytime,
only about 180 are available for running
torque jobs. These are a mix of systems, from
a few old 1.2GHz Athlons up, but the bulk of
the systems are similar to the server. Those
sysytems have 1GB connections everywhere;
some of the older systems are 100Mb ethernet.
General configs:
Qmgr: l s torque
Server torque.farm.intrinsity.com
server_state = Scheduling
scheduling = True
total_jobs = 3329
state_count = Transit:0 Queued:3269 Held:0 Waiting:0 Running:59 Exiting:1
acl_host_enable = False
acl_roots = root
managers = [...]
default_queue = linux
log_events = 256
mail_from = adm
query_other_jobs = True
resources_default.mem = 900mb
resources_default.ncpus = 1
resources_default.nodes = 1
resources_default.walltime = 8760:00:00
resources_assigned.mem = 102939295744b
resources_assigned.ncpus = 59
resources_assigned.nodect = 59
scheduler_iteration = 60
node_check_rate = 150
tcp_timeout = 6
job_stat_rate = 120
poll_jobs = True
log_level = 9
pbs_version = 2.1.8
allow_node_submit = True
server_name = [...]
cat maui.cfg
SERVERHOST torque.farm.intrinsity.com
ADMIN1 root coers brian tomr meo ballgyer
RMCFG[torque.farm.intrinsity.com] TYPE=PBS TIMEOUT=90
SERVERMODE NORMAL
LOGFILE maui.log
LOGFILEMAXSIZE 100000000
LOGLEVEL 0
NODEALLOCATIONPOLICY FIRSTAVAILABLE
DEFERTIME 0
QUEUETIMEWEIGHT 0
CLASSWEIGHT 1
JOBAGGREGATIONTIME 1
BACKFILLPOLICY FIRSTFIT
NODECFG[GLOBAL] GRES=hsimplus:1
qstat -q
server: torque.farm.intrinsity.com
Queue Memory CPU Time Walltime Node Run Que Lm State
---------------- ------ -------- -------- ---- ----- ----- ---- -----
liberate -- -- -- -- 0 0 1 E R
vcs -- -- -- -- 2 0 110 E R
regress -- -- -- -- 0 170 -- E R
assura_post -- -- -- -- 0 0 50 E R
dv_priority -- -- -- -- 0 2539 150 D R
mrunas -- -- -- -- 0 0 -- E R
dft -- -- -- -- 3 0 -- E R
nsa -- -- -- -- 0 0 -- E R
linux -- -- -- -- 26 0 -- E R
routing -- -- -- -- 3 0 6 E R
hsim -- -- -- -- 1 0 4 E R
lvs -- -- -- -- 0 0 3 E R
assura -- -- -- -- 0 0 2 E R
BigSolaris -- -- -- -- 0 0 -- E R
cplace -- -- -- -- 0 0 -- E R
pathmill -- -- -- -- 0 0 4 E R
dcshell -- -- -- -- 0 0 1 E R
BigLinux -- -- -- -- 3 0 -- E R
am_test -- -- -- -- 0 0 -- E R
ncverilog -- -- -- -- 0 0 5 E R
tpr -- -- -- -- 0 0 -- E R
sizing_no_limit -- -- -- -- 0 0 -- E R
pacific -- -- -- -- 0 0 1 E R
calibre -- -- -- -- 0 0 3 E R
eg -- -- -- -- 1 0 -- E R
sizing -- -- -- -- 6 1 8 E R
mrunas_priority -- -- -- -- 0 0 10 E R
performance -- -- -- -- 4 0 20 E R
smoke -- -- -- -- 17 205 -- E R
cell_build -- -- -- -- 0 146 -- E R
----- -----
66 3061
Any ideas?
Thanks,
Miles
More information about the torqueusers
mailing list