[torqueusers] Torque on 1000 nodes ?

Ronny T. Lampert telecaadmin at uni.de
Fri Jul 1 02:08:16 MDT 2005


Hi,

>>We're considering whether to move our 900+ node Linux cluster to
>>the Torque resource manager.  However, we're unsure if Torque
>>2. What special tweaking must be done on large clusters ?
If you expect to have a large number of jobs, I can recommend using reiserfs
for the volume you will install the $PBS_HOME in (or your whole server, if
you like).
reiserfs can cope with many many files without noticeable slowdown,
ext2/ext3 are bad here. I sometimes had as much as 16.000 jobs queued (and I
have 13 nodes only).

Also don't forget to build torque WITHOUT fsync()ing every job!
configure-option is: --disable-filesync
Else you will have large delays when the queue is actively used
(qsub'ing/qstat'ing).
I also run my RAID in write-back buffering mode (server hardware with UPS,
so I can risk it).

Secondly, you may run into time-wait bucket overflows, if you are using lots
of tcp connections. This is not critical per se, but we all don't like
kernel warning messages :)
Never kernels have this value automatically raised, so you may look for (cat
/proc/sys/net/ipv4/tcp_max_tw_buckets).
Put something like this in your pbs_server's /etc/sysctl.conf:
net.ipv4.tcp_max_tw_buckets = 16384

--enable-rpp should also be used, so you'll use UDP (if you have a local and
reliable network).

I sometimes had problems with the pbs_sched quitting because of "too long".
This problems seem to be gone since 1.2. I still run it with an alarm time
of 10 minutes.

pbs_sched -a 600

That's all from me. Have phun.
Cheers,

Ronny


More information about the torqueusers mailing list