[torquedev] Auto delete job when another finishes
"Mgr. Šimon Tóth"
SimonT at mail.muni.cz
Wed Sep 22 05:51:50 MDT 2010
Dne 22.9.2010 02:46, Gareth.Williams at csiro.au napsal(a):
>> -----Original Message-----
>> From: "Mgr. Šimon Tóth" [mailto:SimonT at mail.muni.cz]
>> Sent: Wednesday, 22 September 2010 8:58 AM
>> To: Torque Developers mailing list
>> Cc: Williams, Gareth (CSIRO IM&T, Docklands)
>> Subject: Re: [torquedev] Auto delete job when another finishes
>> On 22.9.2010 00:23, Gareth.Williams at csiro.au wrote:
>>>> -----Original Message-----
>>>> From: torquedev-bounces at supercluster.org [mailto:torquedev-
>>>> bounces at supercluster.org] On Behalf Of "Mgr. Šimon Tóth"
>>>> Sent: Wednesday, 22 September 2010 12:22 AM
>>>> To: Torque Dev. Mailing List
>>>> Subject: [torquedev] Auto delete job when another finishes
>>>> I have been facing a dilemma how to best approach this problem.
>>>> I have two related jobs, one is an outer jobs (builds a virtual
>>>> one is an inner job (runs in the virtual machine). I want to delete
>>>> outer job as soon as the inner is completed. I was thinking about
>>>> a special job dependency for this.
>>>> What do you think?
>>>> Mgr. Šimon Tóth
>>> Interesting thought.
>>> Why can't you just (find and) delete the outer job as the last command
>> in the inner job?
>>> Also, why have separate jobs at all? Why not just build the VM and
>> start work in it in the same job?
>>> Off-hand I don't think that adding a feature to allow a special
>> dependency for this is a good idea.
>> I should probably explain this further. We have a nontrivial piece of
>> code handling the creation, destruction and maintenance of virtual
>> clusters in Torque.
>> Now the problem is that some jobs require custom images, but they are
>> not interested in building clusters or VPN, they just want to run the
>> job inside a specific image.
>> We automatically generate the job that will actually build the machine,
>> but the one run inside the built machine is provided by the user.
> So add to the generated script an invocation of the user script to run the work (or to stage data/jobscript into the VM, run it and return the result). Then the (external) job can shutdown the VM and exit sensibly. I don't see why torque should run the internal job separately.
I have no idea how that could work. For one, the job has only users
privileges and virtual machines can be only manipulated as root. And
neither can be changed.
>> The external job (handling the virtual cluster) cannot end before the
>> internal job is fully finished (this includes epilogue).
>> Having only one job would require huge code changes.
> Perhaps the code changes would not be so big.
I would pretty much have to implement another version of the cluster
management from scratch, that would have different semantics and run in
parallel with the current one.
> The problem is that
>> the creation of a virtual machine can take a lot of time (fetch the
>> image over NFS, write to disk, boot the machine), there are custom hacks
>> that speed up the confirmation of success on both server and node to
>> prevent server hanging. I can't even think about a solution for
>> interactive jobs.
> For interactive jobs maybe you just have a job that advises the user how to connect to the VM (maybe via email...) and how to cancel the session (or just wait until max walltime is reached) and waits indefinitely for the user to connect and do what they want. This is probably unacceptable in many centres but may be OK for you.
Yes, that would be totally unacceptable :)
Mgr. Šimon Tóth
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 3366 bytes
Desc: S/MIME Cryptographic Signature
Url : http://www.supercluster.org/pipermail/torquedev/attachments/20100922/229c06cc/attachment.bin
More information about the torquedev