[torqueusers] Problem with huge number of jobs

Michel Herquet mherquet at fyma.ucl.ac.be
Tue Apr 25 09:09:15 MDT 2006


Hi,

I would like to use Torque to manage a cluster which has to run a huge number 
(~1000) of small jobs (~5min each). 

If a use the following config for the server 

        mail_from = adm
        scheduler_iteration = 60
        node_check_rate = 150
        tcp_timeout = 6
        pbs_version = 2.1.0p0-snap.200603201347

and the following one for the queue:
       
        resources_default.nodes = 1
        resources_default.walltime = 01:00:00
        enabled = True
        started = True

The processes are almost all sent at the same time to the nodes (around 10 per 
nodes), everything seems ok but after some time, on certain nodes, the jobs 
switch to the exiting state but never exit and the node status is 
"down,job-exclusive". These nodes don't accept anymore new jobs and the 
overall running is dramatically slow down until it simply stops (all jobs in 
the queued state).

I would prefer to have one job on each node (or 2 since these are biproc) but 
if I set max_running=30 (approximate number of cpu), all the jobs are sent to 
the first ones and the problem is the same.

Could you help me ? Is anybody has some experience with clusters dealing with 
large number of jobs ?

Thanks in advance!

Michel


More information about the torqueusers mailing list