Miles O'Neal
Thu Jan 10 12:52:33 MST 2008

vaibhav agrawal said...

|I have torque resource manager(ver. 2.1.6) with 100 FreeBSD boxes and using
|maui(3.2.6p16) as scheduler.
|The cluster as a whole behaves very well, when the number of jobs submitted
|is less than 8000 and it executes around 300 jobs as a whole which is as per
|configuration. But when the number of jobs exceeds 8000, it behaves in a
|very strange way. The number of jobs which were getting scheduled descreases
|to 100, sometime 50 or sometime no jobs are scheduled. The cluster goes into
|hung state and the jobs remain queued for a long time(usually 45 mins to an

We try to keep ours under 2500.  Many of our jobs are small,
so we batch them, but even so, we find that at some point
maui gets so busy scheduling it's useless.  The pain point
depends on your config and all network data, so our pain
point wil be different than yours.  We have no easy way to
test outside our production environment, so we don't know
what the pain point is any more.  It used to be below 1500.

We did quite a few things to get here.  We tweaked NFS to
death, and we tweaked the Linux kernel parameters as well
as maui and torque.

I'll try to find a summary of what we did and post it here.

Miles O'Neal
NSA Manager
Intrinsity, Inc.
meo at intrinsity.com

