[torqueusers] scheduling problem; queue stalls

Wesley T. Perdue wes at greenfieldnetworks.com
Mon Sep 13 18:26:11 MDT 2004


Torque users,

I was happy to recently upgrade our cluster from OpenPBS 2.3.16 to Torque 
1.0.1p6-snap.1083163047, as we added some Opterons running RHEL 3 (which 
wouldn't run 2.3.16) to our mix of dual Xeons running RH 7.3.  pbs_server 
and pbs_sched run on a Sun box, which is not a compute node.

We've currently got eight execution nodes with two processors per node, for 
a total of 16 job slots.  We run only single-process jobs.  I've got 
per-queue and per-user limits set such that no single queue or user can 
monopolize the cluster.

We've encountered in torque a problem (bug?) I used to see in PBS.  The 
problem starts when a user quickly submits a large number of jobs (say, 30 
or more), probably via a script.  The number of jobs allowed by the queue 
and user limits begin to run, base don available slots.  After that, 
subsequently submitted jobs are not run, even if there are available jobs 
slots; they remain queued.  If nothing is done, only the jobs submitted in 
the big bundle are run, until they are all gone.  Then scheduling returns 
to normal.

I've found one workaround.  If the queued jobs from the bundle are held 
(using qhold) subsequent jobs are allowed to run.  If they are released 
(using qrls), the queue is once again stalled.  If they are deleted, job 
scheduling returns to normal.

In summary, if a bunch of jobs are quickly queued up by a single user, and 
limits are in place to keep them from monopolizing the cluster, other jobs 
which should run (based on limits and available slots) are held until all 
of the bunch are running (i.e. no longer status Q).

I can't find a clearly defined set of circumstances that reliably create 
this situation.  I've seen it with jobs from only two users on the cluster, 
as well as with jobs from many users.  The one defining fact is that a 
bunch of jobs from one user are quickly queued up.

Needless to say, this dramatically affects job throughput on our cluster, 
as we have free slots open while jobs are queued up.

Any help or advice would be appreciated.

Regards,
Wes

----------
Wes Perdue
IT Manager, Greenfield Networks



More information about the torqueusers mailing list