[torqueusers] Problem with max_running = 1 and large job scripts

Garrick Staples garrick at clusterresources.com
Mon Oct 9 17:54:08 MDT 2006


On Mon, Oct 09, 2006 at 10:01:25AM -0600, Garrick Staples alleged:
> On Sun, Oct 08, 2006 at 09:40:48AM +0200, Gijsbert Wiesenekker alleged:
> > I am running into a problem with Torque version 2.1.2 (built from
> > source) running on a server with Fedora Core 5 kernel 2.6.17-1.2187 x86_64.
> > pbs_server, pbs_sched and pbs_mom are all running on this server.
> > I have defined a queue with max_running = 1 and I submit 5 jobs to that
> > queue. The job scripts are about 300 Kbytes each.
> > The first job is run and the other 4 are queued, so this works fine. But
> > when the first jobs has finished the second is run, after 20 seconds the
> > third is run and so forth until all four jobs are running, although
> > max_running = 1.
> > I found out that this is related to the size of the job scripts by
> > looking at the content of /var/spool/torque/mom_priv directory . It
> > looks like job scripts are sent to mom at a rate of 2-3 Kbyte per second
> > (so it takes about 100 seconds to send my job scripts) and jobs are not
> > seen by the scheduler as running until the complete script has been sent.
> > I have tried running five jobs with small (20 Kbyte) input files and
> > then everything works fine (jobs are run one at a time).
> > Is this a bug or a configuration problem? Any workarounds (increase the
> > rate at which job scripts are sent, decrease the 20 second scheduling
> > interval) available?
> 
> Definitely seems like a race condition with multiple starting jobs, but
> I'd say that *increasing* the scheduling interval will work around it.

Turns out that max_running isn't enforced by the server, it is only
advisory to the scheduler (and is documented as such).  I don't know if
I want to change this definition at this point.




More information about the torqueusers mailing list