[torqueusers] Problem with max_running = 1 and large job scripts
garrick at clusterresources.com
Mon Oct 9 10:01:25 MDT 2006
On Sun, Oct 08, 2006 at 09:40:48AM +0200, Gijsbert Wiesenekker alleged:
> I am running into a problem with Torque version 2.1.2 (built from
> source) running on a server with Fedora Core 5 kernel 2.6.17-1.2187 x86_64.
> pbs_server, pbs_sched and pbs_mom are all running on this server.
> I have defined a queue with max_running = 1 and I submit 5 jobs to that
> queue. The job scripts are about 300 Kbytes each.
> The first job is run and the other 4 are queued, so this works fine. But
> when the first jobs has finished the second is run, after 20 seconds the
> third is run and so forth until all four jobs are running, although
> max_running = 1.
> I found out that this is related to the size of the job scripts by
> looking at the content of /var/spool/torque/mom_priv directory . It
> looks like job scripts are sent to mom at a rate of 2-3 Kbyte per second
> (so it takes about 100 seconds to send my job scripts) and jobs are not
> seen by the scheduler as running until the complete script has been sent.
> I have tried running five jobs with small (20 Kbyte) input files and
> then everything works fine (jobs are run one at a time).
> Is this a bug or a configuration problem? Any workarounds (increase the
> rate at which job scripts are sent, decrease the 20 second scheduling
> interval) available?
Definitely seems like a race condition with multiple starting jobs, but
I'd say that *increasing* the scheduling interval will work around it.
More information about the torqueusers