[torqueusers] Problem with huge number of jobs
mherquet at fyma.ucl.ac.be
Tue Apr 25 09:09:15 MDT 2006
I would like to use Torque to manage a cluster which has to run a huge number
(~1000) of small jobs (~5min each).
If a use the following config for the server
mail_from = adm
scheduler_iteration = 60
node_check_rate = 150
tcp_timeout = 6
pbs_version = 2.1.0p0-snap.200603201347
and the following one for the queue:
resources_default.nodes = 1
resources_default.walltime = 01:00:00
enabled = True
started = True
The processes are almost all sent at the same time to the nodes (around 10 per
nodes), everything seems ok but after some time, on certain nodes, the jobs
switch to the exiting state but never exit and the node status is
"down,job-exclusive". These nodes don't accept anymore new jobs and the
overall running is dramatically slow down until it simply stops (all jobs in
the queued state).
I would prefer to have one job on each node (or 2 since these are biproc) but
if I set max_running=30 (approximate number of cpu), all the jobs are sent to
the first ones and the problem is the same.
Could you help me ? Is anybody has some experience with clusters dealing with
large number of jobs ?
Thanks in advance!
More information about the torqueusers