[torqueusers] Torque 4.0 and job arrays

David Beer dbeer at adaptivecomputing.com
Tue Apr 24 18:03:35 MDT 2012


Rhys,

Once such place is in job_purge in src/server/job_func.c, if all of the
jobs have been purged, the array is then also purged. If you search the
code for the places that call array_delete, then you'll see all of the
conditions under which it is called. Most of them are error conditions, but
I figure you might want to check them all.

David

On Tue, Apr 24, 2012 at 5:33 PM, Rhys Hill <rhys.hill at adelaide.edu.au>wrote:

>  Hi David,
>
>  I'm not sure - the user who was having trouble hasn't yet tried again.
> I'll put a note in bugzilla either way when we've tried again - I've been
> more focussed on getting our normal jobs working!
>
>  With the changes I suggested in bugzilla, 4.0.1 is working well for me,
> except that most or all job arrays aren't being cleaned up. It seems like
> there must be some code somewhere that looks for all the jobs in an array
> to have finished, then cleans up the array structures themselves. I've had
> a look, but cannot find where this should happen. Can you tell me where
> that is? If I can fix this issue, then I think 4.0.1 will be back to the
> same level of reliability as 2.5.9 for us (with more reliable cpusets as
> well!)
>
> Cheers, Rhys
>
>  ----
>
>  Senior Research Associate,
> Australian Centre for Visual Technologies
>
> On 25/04/2012, at 1:16 AM, "David Beer" <dbeer at adaptivecomputing.com>
> wrote:
>
>   Rhys,
>
>  Just to confirm - that patch fixed your problem? If so I will see that
> it gets checked in. We will look at these other bugzilla issues that you
> created. Thanks for taking the time to report them and in some cases offer
> solutions. We really appreciate the effort to help make TORQUE better.
>
>  David
>
> On Tue, Apr 24, 2012 at 12:23 AM, Rhys Hill <rhys.hill at adelaide.edu.au>wrote:
>
>> Hi David,
>>
>> Thanks for that. I've just found and fixed some other bugs which I've
>> added to
>> bugzilla. The one issue that remains is odd. It seems that we have a
>> situation
>> where an array is stuck, when all of it's component jobs are finished.
>>
>> For instance, qstat -f says this:
>>
>> Job Id: 678[].moby.cs.adelaide.edu.au
>>    Job_Name = YZ_Oxford_group
>>    Job_Owner = yanzhichen at moby.cs.adelaide.edu.au
>>    job_state = Q
>>    queue = batch
>>    server = moby.cs.adelaide.edu.au
>>    Checkpoint = u
>>    ctime = Tue Apr 24 09:26:10 2012
>>    Error_Path = moby.cs.adelaide.edu.au:
>> /home/yanzhichen/moby/oxbuilding_voca
>>        bulary/out.e.txt
>>    Hold_Types = n
>>    Join_Path = n
>>    Keep_Files = n
>>    Mail_Points = a
>>    mtime = Tue Apr 24 09:26:10 2012
>>    Output_Path = moby.cs.adelaide.edu.au:
>> /home/yanzhichen/moby/oxbuilding_voc
>>        abulary/out.o.txt
>>    Priority = 0
>>    qtime = Tue Apr 24 09:26:10 2012
>>    Rerunable = True
>>    Resource_List.mem = 5gb
>>    Resource_List.nodect = 1
>>    Resource_List.nodes = 1:ppn=1
>>    Resource_List.pmem = 5gb
>>    Resource_List.pvmem = 8gb
>>    Resource_List.walltime = 48:00:00
>>    etime = Tue Apr 24 09:26:10 2012
>>    submit_args = -t 2-11 ./job_dogroup
>>    job_array_request = 2-11
>>    fault_tolerant = False
>>    job_radix = 0
>>    submit_host = moby.cs.adelaide.edu.au
>>    init_work_dir = /home/yanzhichen/moby/oxbuilding_vocabulary
>>
>> whereas qstat -ft has no mention of 678[x] at all. qdel and qdel -p have
>> no effect
>> on jobs like these. I think I've submitted a fix for the problem that
>> leads to the
>> job getting into this state, but it would be handy if qdel could remove
>> it.
>>
>> Thanks,
>>
>> On 24/04/2012, at 2:52 AM, David Beer wrote:
>>
>> > Rhys,
>> >
>> > I'm surprised that you say you haven't seen this message before, as the
>> check exists in both places and has been there since 2.5 was released.
>> There must've been a bug that allowed it before. In this case, please try
>> the attached patch to see if it resolves your problem for 4.0. This patch
>> only requires you to rebuild and restart the server (dependencies are
>> unknown to pbs_moms).
>> >
>> > David
>> >
>> > On Sun, Apr 22, 2012 at 9:31 PM, Rhys Hill <rhys.hill at adelaide.edu.au>
>> wrote:
>> > Hi everyone,
>> >
>> > I recently upgraded to torque 4.0 alongside moab 7.0, mostly because
>> we'd been
>> > having some trouble with cpusets and I'd hoped that the support for
>> hwloc would
>> > resolve the problem. cpusets are now working very well, but I'm having
>> a lot of
>> > trouble with job arrays, which form a very large part of our workload.
>> >
>> > Torque 4.0.0 would regularly lock-up when processing job arrays, so I
>> upgraded to
>> > the most recent 4.0.1 snapshot, and that behaves much better, but still
>> seems
>> > unstable compared to 2.5.9.
>> >
>> > One concrete issue is that many of our jobs that worked fine with 2.5.9
>> are now
>> > stalling with 4.0.1 with the following message:
>> >
>> > "Arrays may only be given array dependencies"
>> >
>> > which only seems to appear in the server logs and is otherwise
>> invisible. This
>> > was certainly never true before, and doesn't really make sense. We
>> frequently
>> > use array->single job dependencies for scatter-gather type operations.
>> >
>> > Once the above message has been printed, the job arrays sit in a hold
>> state forever.
>> > They can't be removed using qdel and if I try to break the hold using
>> qrls or
>> > mjobctl, they move into the queued state, but they disappear from moab
>> and never
>> > actually start, and still can't be removed. The only way I can get rid
>> of them
>> > is to bring down pbs_server, which has to killed via `killall -QUIT
>> pbs_server`
>> > since the init script cannot stop the process properly, and delete the
>> job
>> > files manually.
>> >
>> > I'm currently thinking of just reverting to the old, working version of
>> torque,
>> > but has anyone else had trouble with job arrays and can the above
>> problems be
>> > fixed somehow?
>> >
>> > Thanks,
>> >
>> >
>> --------------------------------------------------------------------------------
>> > Rhys Hill,                                             Senior Research
>> Associate
>> > Australian Centre for Visual Technologies                 University of
>> Adelaide
>> >
>> > Phone: +61 8 8313 6197                           Mail:
>> > Fax:   +61 8 8313 4366                           School of Computer
>> Science
>> >                                                 University of Adelaide
>> >                                                 Adelaide, Australia
>> > http://www.cs.adelaide.edu.au/~rhys/             5005
>> >
>> --------------------------------------------------------------------------------
>> >
>> > _______________________________________________
>> > torqueusers mailing list
>> > torqueusers at supercluster.org
>> > http://www.supercluster.org/mailman/listinfo/torqueusers
>> >
>> >
>> >
>> > --
>> > David Beer | Software Engineer
>> > Adaptive Computing
>> >
>>  > <ArrayDeps.patch>_______________________________________________
>>  > torqueusers mailing list
>> > torqueusers at supercluster.org
>> > http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>> --------------------------------------------------------------------------------
>> Rhys Hill,                                             Senior Research
>> Associate
>> Australian Centre for Visual Technologies                 University of
>> Adelaide
>>
>> Phone: +61 8 8313 6197                           Mail:
>> Fax:   +61 8 8313 4366                           School of Computer
>> Science
>>                                                 University of Adelaide
>>                                                 Adelaide, Australia
>> http://www.cs.adelaide.edu.au/~rhys/             5005
>>
>> --------------------------------------------------------------------------------
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>
>
>
>  --
> David Beer | Software Engineer
> Adaptive Computing
>
>   _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>


-- 
David Beer | Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120424/5b48ce66/attachment.html 


More information about the torqueusers mailing list