[torqueusers] Problem with huge number of jobs

Ronny T. Lampert telecaadmin at uni.de
Tue Apr 25 11:14:38 MDT 2006


Hi,

> I would like to use Torque to manage a cluster which has to run a huge number 
> (~1000) of small jobs (~5min each). 
[...]
> I would prefer to have one job on each node (or 2 since these are biproc) but 
> if I set max_running=30 (approximate number of cpu), all the jobs are sent to 
> the first ones and the problem is the same.

Wrong approach! The "nodes" file is the to configure this. Use:

node1 np=2
node2 np=2

and so on. Remove the max_running limit. Restart pbs_server.
So you told the server to start 2 jobs per node at max.

> Could you help me ? Is anybody has some experience with clusters dealing with 
> large number of jobs ?

It's not uncommon in my setup to have 1000+ jobs queued. The server is just
coping fine with it.
I'm not running the latest 2.1 snapshot, but 2.0.0p8 at the moment (and a
couple of 1000 jobs were not a problem with 1.2.X either).

That's my ./configure line:

CFLAGS="-O2 -fomit-frame-pointer -march=i686 -ffast-math" ./configure
--disable-filesync --enable-rpp <....>

The "disable-filesync" helps performance; I don't care about power loss
conditions, as I'm running journalled FS + an UPS.

And the relevant qmgr settings:

set server scheduler_iteration = 330
set server node_ping_rate = 180
set server node_check_rate = 300
set server tcp_timeout = 30
set server node_pack = False
set server job_stat_rate = 120
set server poll_jobs = True

Don't forget:
set server default_node = 1#shared

I had the problem of the pbs_sched dying because of "too long".
I fixed this by starting it with "pbs_sched -a 600".
Now I'm running maui instead.

If you have further questions, just ask me.

Cheers,
Ronny


More information about the torqueusers mailing list