[torqueusers] Large cluster considerations
Jerry Smith
jdsmit at sandia.gov
Wed Feb 20 13:42:50 MST 2008
Andy,
qmgr -c "p s"
set server scheduling = True
set server managers = root@*
set server operators = root at admin2
set server default_queue = other
set server log_events = 511
set server mail_from =
set server query_other_jobs = True
set server scheduler_iteration = 90
set server node_ping_rate = 180
set server node_check_rate = 180
set server tcp_timeout = 240
set server job_stat_rate = 120
set server poll_jobs = True
set server log_level = 1
set server mail_domain =
set server pbs_version = 2.1.8
Qmgr: l s
Server admin2
server_state = Active
scheduling = True
total_jobs = 500
state_count = Transit:0 Queued:116 Held:0 Waiting:0 Running:384
Exiting:0
managers = root@*
operators = root at admin2
default_queue = other
log_events = 511
mail_from =
query_other_jobs = True
resources_assigned.nodect = 4175
scheduler_iteration = 90
node_ping_rate = 180
node_check_rate = 180
tcp_timeout = 240
job_stat_rate = 120
poll_jobs = True
log_level = 1
mail_domain =
pbs_version = 2.1.8
The above is for a 4480 node Torque/Moab cluster, ~ 420,000 jobs have
gone through the system.
On average the queue is about 500-600 jobs deep. The timeout tweaking
above is the best suggestion I could make.
Moab/Maui tweaking is yet another thing to think about at that scale as
well.
We don't do any qstat caching or the like, the only noticable lag we see
is when we have users do burst submissions and we see more of a delay
with showq than with qstat/pbsnodes etc.
Hope this helps.
Jerry
Caird, Andrew J wrote:
> Hello all,
>
> We periodically look at Appendix F of the Torque wiki, "Large Clusters
> Considerations"
> (http://www.clusterresources.com/wiki/doku.php?id=torque:appendix:f_larg
> e_cluster_considerations) as our cluster grows.
>
> A while back Garrick mentioned something about never using --disable-rpp
> but that practice is encourage in Appendix F.
>
> Are there other things in that Appendix that are bad ideas?
> Particularly good ideas?
>
> What other things are people doing with large ( > 500 node; > 1000 node)
> clusters? What qualifies as a "large cluster" from Torque's
> perspective?
>
> As we grow, what should be be looking for? Is there an acceptible level
> of pbs_mom errors greater than zero?
>
> Thanks for any advice or discussion.
>
> --andy
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
More information about the torqueusers
mailing list