[torqueusers] Problem with max_running = 1 and large job scripts
garrick at clusterresources.com
Tue Oct 10 01:36:27 MDT 2006
On Mon, Oct 09, 2006 at 05:54:08PM -0600, Garrick Staples alleged:
> On Mon, Oct 09, 2006 at 10:01:25AM -0600, Garrick Staples alleged:
> > On Sun, Oct 08, 2006 at 09:40:48AM +0200, Gijsbert Wiesenekker alleged:
> > > I am running into a problem with Torque version 2.1.2 (built from
> > > source) running on a server with Fedora Core 5 kernel 2.6.17-1.2187 x86_64.
> > > pbs_server, pbs_sched and pbs_mom are all running on this server.
> > > I have defined a queue with max_running = 1 and I submit 5 jobs to that
> > > queue. The job scripts are about 300 Kbytes each.
> > > The first job is run and the other 4 are queued, so this works fine. But
> > > when the first jobs has finished the second is run, after 20 seconds the
> > > third is run and so forth until all four jobs are running, although
> > > max_running = 1.
> > > I found out that this is related to the size of the job scripts by
> > > looking at the content of /var/spool/torque/mom_priv directory . It
> > > looks like job scripts are sent to mom at a rate of 2-3 Kbyte per second
> > > (so it takes about 100 seconds to send my job scripts) and jobs are not
> > > seen by the scheduler as running until the complete script has been sent.
> > > I have tried running five jobs with small (20 Kbyte) input files and
> > > then everything works fine (jobs are run one at a time).
> > > Is this a bug or a configuration problem? Any workarounds (increase the
> > > rate at which job scripts are sent, decrease the 20 second scheduling
> > > interval) available?
> > Definitely seems like a race condition with multiple starting jobs, but
> > I'd say that *increasing* the scheduling interval will work around it.
> Turns out that max_running isn't enforced by the server, it is only
> advisory to the scheduler (and is documented as such). I don't know if
> I want to change this definition at this point.
While I would recommend a larger scheduling interval anyways (60 seconds
is typical), I have a better work-around...
Typically the batch scripts are very short, within a few dozen lines.
Large scripts can be stored seperately an simply called from the batch
echo "$PWD/large-script.sh" | qsub -N largejob
More information about the torqueusers