[torqueusers] job dependencies, requeuing and routing
Gareth.Williams at csiro.au
Gareth.Williams at csiro.au
Thu Nov 3 22:22:09 MDT 2011
> -----Original Message-----
> From: Eiríkur Hjartarson [mailto:Eirikur.Hjartarson at decode.is]
> Sent: Wednesday, 2 November 2011 9:30 PM
> To: Torque Users Mailing List
> Subject: Re: [torqueusers] job dependencies, requeuing and routing
> I described similar problems in
> September/013404.html that are still unresolved. We are using torque
> 2.5.8 (and maui 3.3.1).
> Eiríkur Hjartarson E-mail: Eirikur.Hjartarson at decode.is
> Íslensk Erfðagreining Mobile: +3546641898
> Sturlugötu 7
> IS-101 Reykjavík
I think the problems are indeed related and yours (the first one) should be easier to reproduce than mine.
Your second problem is a separate problem I think which does not affect our site as we don't have the max_user_queuable limit. We have been thinking of adding such a limit (to lower memory use in the scheduler) but this problem is a good reason for us to hold off doing so. Fixing the first problem may be needed to enable jobs with prerequisites to be held in the routing queue and still get released properly.
Ken, the job dependencies are being set up fine. The problem is the associated holds are not being released in some cases as the prerequisite jobs complete. You can probably reproduce this using qrun as the scheduler.
- Set up queues qr (route) and qe (exec) with a small max_user_queuable limit (maybe 1).
- Keep complete jobs (for a small time, say 2 minutes).
- Submit to qr jobs ja, some intervening jobs (jc, jd, je) and jb with dependency on ja, holding ja until jb is submitted then release a
- ja should route to the execution queue and jc, jd, je and jb stay in qr
- qrun ja (on an available node)
- when it's done, jc should route to qe and can be qrun (then jd and je)
- jb should then route to qe but (I anticipate) will not qrun unless it's within the 2 minutes...
If the max_user_queuable limit is higher and the jobs all go straight to qr, I think the problem will not occur.
Later... Well that is what I thought would happen. I tested and the situation is worse than I'd imagined. I set max_user_queuable to 1 and that seems fine until you add in holds. If I submit a job with a hold (to make sure it's still there when I submit a dependent job) to the routing queue, then by default it does not route (route_held_jobs = false). Then I submit a dependent job and it does route so I have one job that can't run in the execution queue. The job that needs to run first is still held in the routing queue. I release it with qrls -h u but it can't route, presumably because the max_user_queuable limit is already satisfied! This is Eiríkur's second problem.
If a just submit a single job with a hold to the routing queue and then release it, it routes to the execution queue. (Moab failed to start the job in my specific case, it seems because I have a mapping between classes/queues and nodes and it got confused... It was OK for another queue. Maui might have the same issue.).
If I submit a job with a time dependent hold (qsub -a `date -d 'now + 5 minutes' +'%Y%m%d%H%M'` test.q) it stays in the routing queue until the time limit is met, then gets routed (saw the same moab issue here too).
Trying harder and avoiding moab issues with a simpler queue and no max_user_queuable... If I submit a job with a hold and a dependent job with a hold (qsub -h -W depend...) I can then release the first hold (qrls -h u) and the first job runs. If I release the second job soon enough (while the first job is running or during the keep_completed period), it routes to the execution queue, and when the first job is finished, it starts. However, if I don't release the second job until after the first is finished and the keep_completed period is over, the released job routes to the execution queue but remain held. Releasing the job a second time once it's in the execution queue (qrls -h u) allows it to be started.
I'm not sure where to go from here. What are the chances of these problems being fixed?
Gareth (who is at a conference next week so worked hard to get testing done today)
More information about the torqueusers