[torqueusers] Torque 4.0 and job arrays

Ken Nielson knielson at adaptivecomputing.com
Mon Apr 23 11:17:34 MDT 2012


On Sun, Apr 22, 2012 at 9:31 PM, Rhys Hill <rhys.hill at adelaide.edu.au>wrote:

> Hi everyone,
>
> I recently upgraded to torque 4.0 alongside moab 7.0, mostly because we'd
> been
> having some trouble with cpusets and I'd hoped that the support for hwloc
> would
> resolve the problem. cpusets are now working very well, but I'm having a
> lot of
> trouble with job arrays, which form a very large part of our workload.
>
> Torque 4.0.0 would regularly lock-up when processing job arrays, so I
> upgraded to
> the most recent 4.0.1 snapshot, and that behaves much better, but still
> seems
> unstable compared to 2.5.9.
>
> One concrete issue is that many of our jobs that worked fine with 2.5.9
> are now
> stalling with 4.0.1 with the following message:
>
> "Arrays may only be given array dependencies"
>
> which only seems to appear in the server logs and is otherwise invisible.
> This
> was certainly never true before, and doesn't really make sense. We
> frequently
> use array->single job dependencies for scatter-gather type operations.
>
> Once the above message has been printed, the job arrays sit in a hold
> state forever.
> They can't be removed using qdel and if I try to break the hold using qrls
> or
> mjobctl, they move into the queued state, but they disappear from moab and
> never
> actually start, and still can't be removed. The only way I can get rid of
> them
> is to bring down pbs_server, which has to killed via `killall -QUIT
> pbs_server`
> since the init script cannot stop the process properly, and delete the job
> files manually.
>
> I'm currently thinking of just reverting to the old, working version of
> torque,
> but has anyone else had trouble with job arrays and can the above problems
> be
> fixed somehow?
>
> Thanks,
>
>
> --------------------------------------------------------------------------------
> Rhys Hill,                                             Senior Research
> Associate
> Australian Centre for Visual Technologies                 University of
> Adelaide
>
>
Rhys,

Thanks for the information. We will look at this in TORQUE 4.0. In the mean
time you may want to create a ticket in bugzilla.
http://www.clusterresources.com/bugzilla/

Regards

Ken
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120423/83717aef/attachment.html 


More information about the torqueusers mailing list