[torqueusers] Torque on 1000 nodes ?

Ole Holm Nielsen Ole.H.Nielsen at fysik.dtu.dk
Fri Jul 1 07:42:22 MDT 2005


Ronny T. Lampert wrote:
> Also don't forget to build torque WITHOUT fsync()ing every job!
> configure-option is: --disable-filesync
> Else you will have large delays when the queue is actively used
> (qsub'ing/qstat'ing).

OK, that's documented at the end of this document:
http://www.clusterresources.com/products/torque/docs/1.0.1buildoverview.shtml

> Secondly, you may run into time-wait bucket overflows, if you are using lots
> of tcp connections. This is not critical per se, but we all don't like
> kernel warning messages :)
> Never kernels have this value automatically raised, so you may look for (cat
> /proc/sys/net/ipv4/tcp_max_tw_buckets).
> Put something like this in your pbs_server's /etc/sysctl.conf:
> net.ipv4.tcp_max_tw_buckets = 16384

The default value on my Redhat RHEL 4.0 server is 180000.

> --enable-rpp should also be used, so you'll use UDP (if you have a local and
> reliable network).

Your suggestion disagrees with the official advice in
http://www.clusterresources.com/products/torque/docs/3.4largesystems.shtml
saying "For clusters larger than 300 nodes, it is recommended that
TORQUE be built with the '--disable-rpp' flag passed to configure".

> I sometimes had problems with the pbs_sched quitting because of "too long".
> This problems seem to be gone since 1.2. I still run it with an alarm time
> of 10 minutes.
> 
> pbs_sched -a 600

Well, we run the Maui scheduler, so that's a bit different for us.
Some further advice was given in
http://www.supercluster.org/pipermail/torqueusers/2005-June/001638.html

Regards,
Ole Holm Nielsen
Department of Physics, Technical University of Denmark


More information about the torqueusers mailing list