[torqueusers] Automatically resubmitting a failed job
Joshua Bernstein
jbernstein at penguincomputing.com
Tue Dec 8 14:22:43 MST 2009
Prakash,
There is a re-runable option you can set when you submit a job, but from what
I've seen its been broken in recent versions of TORQUE. YMMV, though so you
might want to try to enable it yourself. There isn't a way of setting a limit
like "3" on it, but you might want to have a look at this thread
http://www.clusterresources.com/pipermail/torqueusers/2006-August/004107.html
-Joshua Bernstein
Senior Software Engineer
Penguin Computing
Prakash Velayutham wrote:
> Hello,
>
> Some of my jobs seem to fail for unknown reasons (application bug
> would be my guess). I can see that the exit_status is non-zero from
> Torque's perspective. I would like to retry these jobs (a max of 3
> attempts) automatically. Is there an easy way to do that? Or do I have
> to create a dummy job that spins idly checking for the real job to
> finish, check its output status and then resubmit if it fails??
>
> Thanks,
> Prakash
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
More information about the torqueusers
mailing list