[torqueusers] automatic job restart

Sam Rash srash at yahoo-inc.com
Mon Dec 11 17:30:19 MST 2006


I believe this was discussed a while back:  if a job returns non-zero
(fail), the job is never re-executed by torque, right?

 

If this is the case, is it possible to ever implement some interface with
the returned signals (or stdout/stderr), some way) that a program can
indicate "I failed, but due to something likely local to this machine or
temporally, try me again so my job ID doesn't change, etc"

 

A job could specify at submit time what exit code indicates re-run.  If you
wanted to get more tricky, the job could even return more helpful hints such
as re-run anywhere (not abundantly clear this is that useful now) or re-run
on another host, or perhaps others.  To avoid confusion with existing jobs
(ie those ppl have today), the job MUST specify it wants to be treated this
way so return code of say 10 doesn't cause restarts when it shouldn't.

 

This allows for nice wrapper scripts to the real job that can check why
something failed perhaps and indicate the re-run

 

Why this works better than say an after not ok dependency that checks &
submits is I need that job to keep the same ID (so items dependent on the
original job would be dependent on the new one).  

 

Or is there a simple, stateless external way to do this?  (ie I don't want
to have to find all items dependent to a failed job and update every
dependency).

 

Some notation of limited restartability for a job is very helpful to obtain
clean fault tolerance.

 

-sr

 

 

Sam Rash

srash at yahoo-inc.com

408-349-7312

vertigosr37

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20061211/958196c3/attachment.html


More information about the torqueusers mailing list