[torqueusers] Automatically resubmitting a failed job

Joshua Bernstein jbernstein at penguincomputing.com
Tue Dec 8 14:22:43 MST 2009


There is a re-runable option you can set when you submit a job, but from what 
I've seen its been broken in recent versions of TORQUE. YMMV, though so you 
might want to try to enable it yourself. There isn't a way of setting a limit 
like "3" on it, but you might want to have a look at this thread


-Joshua Bernstein
Senior Software Engineer
Penguin Computing

Prakash Velayutham wrote:
> Hello,
> Some of my jobs seem to fail for unknown reasons (application bug  
> would be my guess). I can see that the exit_status is non-zero from  
> Torque's perspective. I would like to retry these jobs (a max of 3  
> attempts) automatically. Is there an easy way to do that? Or do I have  
> to create a dummy job that spins idly checking for the real job to  
> finish, check its output status and then resubmit if it fails??
> Thanks,
> Prakash
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

More information about the torqueusers mailing list