[torqueusers] scheduling problem; queue stalls
Wesley T. Perdue
wes at greenfieldnetworks.com
Mon Sep 13 18:26:11 MDT 2004
I was happy to recently upgrade our cluster from OpenPBS 2.3.16 to Torque
1.0.1p6-snap.1083163047, as we added some Opterons running RHEL 3 (which
wouldn't run 2.3.16) to our mix of dual Xeons running RH 7.3. pbs_server
and pbs_sched run on a Sun box, which is not a compute node.
We've currently got eight execution nodes with two processors per node, for
a total of 16 job slots. We run only single-process jobs. I've got
per-queue and per-user limits set such that no single queue or user can
monopolize the cluster.
We've encountered in torque a problem (bug?) I used to see in PBS. The
problem starts when a user quickly submits a large number of jobs (say, 30
or more), probably via a script. The number of jobs allowed by the queue
and user limits begin to run, base don available slots. After that,
subsequently submitted jobs are not run, even if there are available jobs
slots; they remain queued. If nothing is done, only the jobs submitted in
the big bundle are run, until they are all gone. Then scheduling returns
I've found one workaround. If the queued jobs from the bundle are held
(using qhold) subsequent jobs are allowed to run. If they are released
(using qrls), the queue is once again stalled. If they are deleted, job
scheduling returns to normal.
In summary, if a bunch of jobs are quickly queued up by a single user, and
limits are in place to keep them from monopolizing the cluster, other jobs
which should run (based on limits and available slots) are held until all
of the bunch are running (i.e. no longer status Q).
I can't find a clearly defined set of circumstances that reliably create
this situation. I've seen it with jobs from only two users on the cluster,
as well as with jobs from many users. The one defining fact is that a
bunch of jobs from one user are quickly queued up.
Needless to say, this dramatically affects job throughput on our cluster,
as we have free slots open while jobs are queued up.
Any help or advice would be appreciated.
IT Manager, Greenfield Networks
More information about the torqueusers