[torqueusers] Problem with huge number of jobs
Ronny T. Lampert
telecaadmin at uni.de
Tue Apr 25 11:14:38 MDT 2006
> I would like to use Torque to manage a cluster which has to run a huge number
> (~1000) of small jobs (~5min each).
> I would prefer to have one job on each node (or 2 since these are biproc) but
> if I set max_running=30 (approximate number of cpu), all the jobs are sent to
> the first ones and the problem is the same.
Wrong approach! The "nodes" file is the to configure this. Use:
and so on. Remove the max_running limit. Restart pbs_server.
So you told the server to start 2 jobs per node at max.
> Could you help me ? Is anybody has some experience with clusters dealing with
> large number of jobs ?
It's not uncommon in my setup to have 1000+ jobs queued. The server is just
coping fine with it.
I'm not running the latest 2.1 snapshot, but 2.0.0p8 at the moment (and a
couple of 1000 jobs were not a problem with 1.2.X either).
That's my ./configure line:
CFLAGS="-O2 -fomit-frame-pointer -march=i686 -ffast-math" ./configure
--disable-filesync --enable-rpp <....>
The "disable-filesync" helps performance; I don't care about power loss
conditions, as I'm running journalled FS + an UPS.
And the relevant qmgr settings:
set server scheduler_iteration = 330
set server node_ping_rate = 180
set server node_check_rate = 300
set server tcp_timeout = 30
set server node_pack = False
set server job_stat_rate = 120
set server poll_jobs = True
set server default_node = 1#shared
I had the problem of the pbs_sched dying because of "too long".
I fixed this by starting it with "pbs_sched -a 600".
Now I'm running maui instead.
If you have further questions, just ask me.
More information about the torqueusers