[torqueusers] job dependencies, requeuing and routing

Gareth.Williams at csiro.au Gareth.Williams at csiro.au
Tue Nov 1 17:17:16 MDT 2011


Hi All,

We recently had a few events as follows:
- job 'A' was queued followed by job 'B' depending on 'A'
- when the scheduler decided to start job 'A' the pbs_mom failed to start the job within some timelimit and 'A' went back to the queue (the node was apparently particularly busy)
- some time later job 'A' ran successfully, but the dependency for job 'B' on "A' was not deemed to be satisfied and job 'B' was left stranded.

Many other jobs with similar dependencies have been working fine.

I suspect the problem is related to the jobs being submitted to a routing queue.  The routing setup is simple in that it (mostly) just puts jobs in a single execution queue. The requeued job goes back to the routing queue and the dependent job stays in the execution queue. I'm not sure how to reproduce the job start failure - which makes it difficult to diagnose the problem further.

We are running 3.0.3-snap.201108261653 and moab client 6.0.2 (revision 3, changeset b88217da5915d0a5ec6480f06677cb36e5fa7305)

Has anyone seen similar/related problems? 

Or does anyone know enough about job dependencies to know if the implementation might have a bug when combined with routing and job start failure in this way? 

Do you think the problem is in torque or moab? 

Regards,

Gareth


More information about the torqueusers mailing list