[torqueusers] torque not scaling well

Miles O'Neal meo at intrinsity.com
Tue Jul 31 17:44:12 MDT 2007


We switched from pbs with the server and
scheduler running on Solaris 7 on a 300MHz
Sparc to torque/maui running CentOS 4.4 on
a 2.16GHz Core 2 Duo.  But we still start
seeing slowdows in job run rates somewhere
between 1000 and 1500 jobs queued, and once
we get up around 3K jobs queued, forget it.
We have several users who might run scripts
doing multiple qsubs simultaneously, and
this makes it worse.  We currently have about
250 client nodes, but during the daytime,
only about 180 are available for running
torque jobs.  These are a mix of systems, from
a few old 1.2GHz Athlons up, but the bulk of
the systems are similar to the server.  Those
sysytems have 1GB connections everywhere;
some of the older systems are 100Mb ethernet.

General configs:

Qmgr: l s torque
Server torque.farm.intrinsity.com
        server_state = Scheduling
        scheduling = True
        total_jobs = 3329
        state_count = Transit:0 Queued:3269 Held:0 Waiting:0 Running:59 Exiting:1 
        acl_host_enable = False
        acl_roots = root
        managers = [...]
        default_queue = linux
        log_events = 256
        mail_from = adm
        query_other_jobs = True
        resources_default.mem = 900mb
        resources_default.ncpus = 1
        resources_default.nodes = 1
        resources_default.walltime = 8760:00:00
        resources_assigned.mem = 102939295744b
        resources_assigned.ncpus = 59
        resources_assigned.nodect = 59
        scheduler_iteration = 60
        node_check_rate = 150
        tcp_timeout = 6
        job_stat_rate = 120
        poll_jobs = True
        log_level = 9
        pbs_version = 2.1.8
        allow_node_submit = True
        server_name = [...]

cat maui.cfg
SERVERHOST            torque.farm.intrinsity.com
ADMIN1                root coers brian tomr meo ballgyer
RMCFG[torque.farm.intrinsity.com]  TYPE=PBS TIMEOUT=90
SERVERMODE            NORMAL
LOGFILE               maui.log
LOGFILEMAXSIZE        100000000
LOGLEVEL              0
NODEALLOCATIONPOLICY  FIRSTAVAILABLE
DEFERTIME 0
QUEUETIMEWEIGHT 0
CLASSWEIGHT 1
JOBAGGREGATIONTIME 1
BACKFILLPOLICY        FIRSTFIT
NODECFG[GLOBAL] GRES=hsimplus:1

qstat -q

server: torque.farm.intrinsity.com

Queue            Memory CPU Time Walltime Node    Run   Que   Lm  State
---------------- ------ -------- -------- ----  ----- ----- ----  -----
liberate           --      --       --      --      0     0    1   E R
vcs                --      --       --      --      2     0  110   E R
regress            --      --       --      --      0   170   --   E R
assura_post        --      --       --      --      0     0   50   E R
dv_priority        --      --       --      --      0  2539  150   D R
mrunas             --      --       --      --      0     0   --   E R
dft                --      --       --      --      3     0   --   E R
nsa                --      --       --      --      0     0   --   E R
linux              --      --       --      --     26     0   --   E R
routing            --      --       --      --      3     0    6   E R
hsim               --      --       --      --      1     0    4   E R
lvs                --      --       --      --      0     0    3   E R
assura             --      --       --      --      0     0    2   E R
BigSolaris         --      --       --      --      0     0   --   E R
cplace             --      --       --      --      0     0   --   E R
pathmill           --      --       --      --      0     0    4   E R
dcshell            --      --       --      --      0     0    1   E R
BigLinux           --      --       --      --      3     0   --   E R
am_test            --      --       --      --      0     0   --   E R
ncverilog          --      --       --      --      0     0    5   E R
tpr                --      --       --      --      0     0   --   E R
sizing_no_limit    --      --       --      --      0     0   --   E R
pacific            --      --       --      --      0     0    1   E R
calibre            --      --       --      --      0     0    3   E R
eg                 --      --       --      --      1     0   --   E R
sizing             --      --       --      --      6     1    8   E R
mrunas_priority    --      --       --      --      0     0   10   E R
performance        --      --       --      --      4     0   20   E R
smoke              --      --       --      --     17   205   --   E R
cell_build         --      --       --      --      0   146   --   E R
                                               ----- -----
                                                  66  3061


Any ideas?

Thanks,
Miles


More information about the torqueusers mailing list