> Hi all,
> 	I have been experiencing a problem with a user submitting thousands
> of jobs. Out of most of the jobs they seem to either finish in a
> matter of seconds or aren't even doing anything. I'm using torque,
> maui and gold. Now I'm using a routing queue to contain the 10,000
> jobs they submit (all single cpu jobs). The routing queue works fine
> and routes to the proper execution queue (able to run 116 at a time).
> However, I notice as the system is chewing through the jobs trying to
> execute them they drop off so fast the system is having a hard time
> trying to keep up. The mysql server goes to 100% and even a load on
> goldd. I suspect it's because the flurry of jobs starting/stopping so
> fast that creating the reservations and other record-keeping in maui/
> gold is making this load.
> 	I'm hoping to get the user to make some changes to how they submit
> jobs (but they can be difficult at times). I suspect that even if the
> jobs ran for 5 minutes or so that then the system could at least keep
> up. So I'm curious to know if any others ran into this type of problem
> and what you did to solve it. Are there some changes in torque/maui/
> gold that I could make to help alleviate this?

I posted the exact same question on the gold list last year (but the archive 
at pnl.gov is gone and I could not find the thread on the clusterresources 

If you do not want to write your own layer between maui and gold you're pretty 
much stuck.

We ended up limiting the number of idle and running jobs per user.  Per 
default each user is limited to 200 running jobs and 16 idle jobs


Our policy is not to optimize the batch system for lots of small jobs.  By 
setting the above limits we sort of encourage our users to adjust their work 
setup.  Even if you bring down the response time from accounting the scaling 
will be limited,  Amdahls law will kick in eventually...

If you're using postgres as the backend for gold you should vacuum the 
database regularly.  The gold user has this in crontab

# su - gold
-bash-3.00$ crontab -l
00 04 * * * sh /opt/gold/vacuum.sh
-bash-3.00$ cat /opt/gold/vacuum.sh

# vacuum the database, makes it run faster.

/usr/bin/psql -c "vacuum; vacuum analyze;"

Doing this brought down the accounting response down from 6 to 1 seconds.  
(our db server is really slow...)


