[torqueusers] Torque 4.0 and job arrays

David Beer dbeer at adaptivecomputing.com
Mon Apr 23 11:22:27 MDT 2012


Rhys,

I'm surprised that you say you haven't seen this message before, as the
check exists in both places and has been there since 2.5 was released.
There must've been a bug that allowed it before. In this case, please try
the attached patch to see if it resolves your problem for 4.0. This patch
only requires you to rebuild and restart the server (dependencies are
unknown to pbs_moms).

David

On Sun, Apr 22, 2012 at 9:31 PM, Rhys Hill <rhys.hill at adelaide.edu.au>wrote:

> Hi everyone,
>
> I recently upgraded to torque 4.0 alongside moab 7.0, mostly because we'd
> been
> having some trouble with cpusets and I'd hoped that the support for hwloc
> would
> resolve the problem. cpusets are now working very well, but I'm having a
> lot of
> trouble with job arrays, which form a very large part of our workload.
>
> Torque 4.0.0 would regularly lock-up when processing job arrays, so I
> upgraded to
> the most recent 4.0.1 snapshot, and that behaves much better, but still
> seems
> unstable compared to 2.5.9.
>
> One concrete issue is that many of our jobs that worked fine with 2.5.9
> are now
> stalling with 4.0.1 with the following message:
>
> "Arrays may only be given array dependencies"
>
> which only seems to appear in the server logs and is otherwise invisible.
> This
> was certainly never true before, and doesn't really make sense. We
> frequently
> use array->single job dependencies for scatter-gather type operations.
>
> Once the above message has been printed, the job arrays sit in a hold
> state forever.
> They can't be removed using qdel and if I try to break the hold using qrls
> or
> mjobctl, they move into the queued state, but they disappear from moab and
> never
> actually start, and still can't be removed. The only way I can get rid of
> them
> is to bring down pbs_server, which has to killed via `killall -QUIT
> pbs_server`
> since the init script cannot stop the process properly, and delete the job
> files manually.
>
> I'm currently thinking of just reverting to the old, working version of
> torque,
> but has anyone else had trouble with job arrays and can the above problems
> be
> fixed somehow?
>
> Thanks,
>
>
> --------------------------------------------------------------------------------
> Rhys Hill,                                             Senior Research
> Associate
> Australian Centre for Visual Technologies                 University of
> Adelaide
>
> Phone: +61 8 8313 6197                           Mail:
> Fax:   +61 8 8313 4366                           School of Computer
> Science
>                                                 University of Adelaide
>                                                 Adelaide, Australia
> http://www.cs.adelaide.edu.au/~rhys/             5005
>
> --------------------------------------------------------------------------------
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>



-- 
David Beer | Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120423/c55502b2/attachment.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ArrayDeps.patch
Type: application/octet-stream
Size: 754 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120423/c55502b2/attachment.obj 


More information about the torqueusers mailing list