[torqueusers] Problem with max_running = 1 and large job scripts
gijsbert.wiesenekker at gmail.com
Sun Oct 8 01:40:48 MDT 2006
I am running into a problem with Torque version 2.1.2 (built from
source) running on a server with Fedora Core 5 kernel 2.6.17-1.2187 x86_64.
pbs_server, pbs_sched and pbs_mom are all running on this server.
I have defined a queue with max_running = 1 and I submit 5 jobs to that
queue. The job scripts are about 300 Kbytes each.
The first job is run and the other 4 are queued, so this works fine. But
when the first jobs has finished the second is run, after 20 seconds the
third is run and so forth until all four jobs are running, although
max_running = 1.
I found out that this is related to the size of the job scripts by
looking at the content of /var/spool/torque/mom_priv directory . It
looks like job scripts are sent to mom at a rate of 2-3 Kbyte per second
(so it takes about 100 seconds to send my job scripts) and jobs are not
seen by the scheduler as running until the complete script has been sent.
I have tried running five jobs with small (20 Kbyte) input files and
then everything works fine (jobs are run one at a time).
Is this a bug or a configuration problem? Any workarounds (increase the
rate at which job scripts are sent, decrease the 20 second scheduling
More information about the torqueusers