[torqueusers] Torque 4.0 and job arrays

Rhys Hill rhys.hill at adelaide.edu.au
Sun Apr 22 21:31:53 MDT 2012


Hi everyone,

I recently upgraded to torque 4.0 alongside moab 7.0, mostly because we'd been
having some trouble with cpusets and I'd hoped that the support for hwloc would
resolve the problem. cpusets are now working very well, but I'm having a lot of
trouble with job arrays, which form a very large part of our workload.

Torque 4.0.0 would regularly lock-up when processing job arrays, so I upgraded to
the most recent 4.0.1 snapshot, and that behaves much better, but still seems
unstable compared to 2.5.9.

One concrete issue is that many of our jobs that worked fine with 2.5.9 are now
stalling with 4.0.1 with the following message:

"Arrays may only be given array dependencies"

which only seems to appear in the server logs and is otherwise invisible. This
was certainly never true before, and doesn't really make sense. We frequently
use array->single job dependencies for scatter-gather type operations.

Once the above message has been printed, the job arrays sit in a hold state forever.
They can't be removed using qdel and if I try to break the hold using qrls or
mjobctl, they move into the queued state, but they disappear from moab and never
actually start, and still can't be removed. The only way I can get rid of them
is to bring down pbs_server, which has to killed via `killall -QUIT pbs_server`
since the init script cannot stop the process properly, and delete the job
files manually.

I'm currently thinking of just reverting to the old, working version of torque,
but has anyone else had trouble with job arrays and can the above problems be
fixed somehow?

Thanks,

--------------------------------------------------------------------------------
Rhys Hill,                                             Senior Research Associate
Australian Centre for Visual Technologies                 University of Adelaide

Phone: +61 8 8313 6197                           Mail:
Fax:   +61 8 8313 4366                           School of Computer Science
                                                 University of Adelaide
                                                 Adelaide, Australia
http://www.cs.adelaide.edu.au/~rhys/             5005
--------------------------------------------------------------------------------



More information about the torqueusers mailing list