[torqueusers] Large cluster considerations

Jerry Smith jdsmit at sandia.gov
Wed Feb 20 13:42:50 MST 2008


Andy,
qmgr -c "p s"

set server scheduling = True
set server managers = root@*
set server operators = root at admin2
set server default_queue = other
set server log_events = 511
set server mail_from =
set server query_other_jobs = True
set server scheduler_iteration = 90
set server node_ping_rate = 180
set server node_check_rate = 180
set server tcp_timeout = 240
set server job_stat_rate = 120
set server poll_jobs = True
set server log_level = 1
set server mail_domain =
set server pbs_version = 2.1.8

Qmgr: l s
Server admin2
        server_state = Active
        scheduling = True
        total_jobs = 500
        state_count = Transit:0 Queued:116 Held:0 Waiting:0 Running:384 
Exiting:0
        managers = root@*
        operators = root at admin2
        default_queue = other
        log_events = 511
        mail_from =
        query_other_jobs = True
        resources_assigned.nodect = 4175
        scheduler_iteration = 90
        node_ping_rate = 180
        node_check_rate = 180
        tcp_timeout = 240
        job_stat_rate = 120
        poll_jobs = True
        log_level = 1
        mail_domain =
        pbs_version = 2.1.8




The above is for a 4480 node Torque/Moab cluster, ~ 420,000 jobs have 
gone through the system.

On average the queue is about 500-600 jobs deep.  The timeout tweaking 
above is the best suggestion I could make. 

Moab/Maui tweaking is yet another thing to think about at that scale as 
well.

We don't do any qstat caching or the like, the only noticable lag we see 
is when we have users do burst submissions and we see more of a delay 
with showq than with qstat/pbsnodes etc.

Hope this helps.

Jerry

Caird, Andrew J wrote:
> Hello all,
>
> We periodically look at Appendix F of the Torque wiki, "Large Clusters
> Considerations"
> (http://www.clusterresources.com/wiki/doku.php?id=torque:appendix:f_larg
> e_cluster_considerations) as our cluster grows.
>
> A while back Garrick mentioned something about never using --disable-rpp
> but that practice is encourage in Appendix F.
>
> Are there other things in that Appendix that are bad ideas?
> Particularly good ideas?
>
> What other things are people doing with large ( > 500 node; > 1000 node)
> clusters?  What qualifies as a "large cluster" from Torque's
> perspective?
>
> As we grow, what should be be looking for?  Is there an acceptible level
> of pbs_mom errors greater than zero?
>
> Thanks for any advice or discussion.
>
> --andy
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>   



More information about the torqueusers mailing list