[torqueusers] job dependencies, requeuing and routing
Eirikur.Hjartarson at decode.is
Wed Nov 2 04:29:42 MDT 2011
I described similar problems in http://www.clusterresources.com/pipermail/torqueusers/2011-September/013404.html that are still unresolved. We are using torque 2.5.8 (and maui 3.3.1).
Eiríkur Hjartarson E-mail: Eirikur.Hjartarson at decode.is
Íslensk Erfðagreining Mobile: +3546641898
From: torqueusers-bounces at supercluster.org [torqueusers-bounces at supercluster.org] on behalf of Gareth.Williams at csiro.au [Gareth.Williams at csiro.au]
Sent: Tuesday, November 01, 2011 11:17 PM
To: torqueusers at supercluster.org; moabusers at supercluster.org
Subject: [torqueusers] job dependencies, requeuing and routing
We recently had a few events as follows:
- job 'A' was queued followed by job 'B' depending on 'A'
- when the scheduler decided to start job 'A' the pbs_mom failed to start the job within some timelimit and 'A' went back to the queue (the node was apparently particularly busy)
- some time later job 'A' ran successfully, but the dependency for job 'B' on "A' was not deemed to be satisfied and job 'B' was left stranded.
Many other jobs with similar dependencies have been working fine.
I suspect the problem is related to the jobs being submitted to a routing queue. The routing setup is simple in that it (mostly) just puts jobs in a single execution queue. The requeued job goes back to the routing queue and the dependent job stays in the execution queue. I'm not sure how to reproduce the job start failure - which makes it difficult to diagnose the problem further.
We are running 3.0.3-snap.201108261653 and moab client 6.0.2 (revision 3, changeset b88217da5915d0a5ec6480f06677cb36e5fa7305)
Has anyone seen similar/related problems?
Or does anyone know enough about job dependencies to know if the implementation might have a bug when combined with routing and job start failure in this way?
Do you think the problem is in torque or moab?
torqueusers mailing list
torqueusers at supercluster.org
More information about the torqueusers