[torqueusers] How to find out why a job failed?

Garrick Staples garrick at clusterresources.com
Tue Jun 27 11:03:18 MDT 2006


On Tue, Jun 27, 2006 at 10:42:35AM -0700, Keenahn Jung alleged:
> Thank you for your quick replies! They were very helpful. However, I
> want the scheduler to be smart enough to act on the different return
> codes. For example, for an exit status of X, reschedule the job
> immediately, for exit status Y, alert an admin etc. Should I put this
> logic in the epilogue script?

That would be up to the scheduler.  I don't think the sample C scheduler
or maui have this feature, but I assume moab does.


> Thanks, K
> 
> -----Original Message-----
> From: torqueusers-bounces at supercluster.org
> [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Garrick
> Staples
> Sent: Tuesday, June 27, 2006 7:29 AM
> To: torqueusers at supercluster.org
> Subject: Re: [torqueusers] How to find out why a job failed?
> 
> On Mon, Jun 26, 2006 at 11:52:50AM -0700, Keenahn Jung alleged:
> > Hello, I want to be able to trace the failure of a job. My idea is to
> > have the script in the job have different return codes. How can I keep
> > track of these return codes after the job fails? I have searched the
> > documentation and previous emails and couldn't find anything. This
> must
> > be a problem other have solved before. Thank you!
> 
> Enable "keep_completed" at the server or queue level.  When jobs exit,
> you'll be able to read the "exit_status" job attribute.
> 
> Or as Mr. Widyono says, just put the exit code in the job output.
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list