[torqueusers] Force a job to rerun after mom has crashed

Ken Nielson knielson at adaptivecomputing.com
Thu Aug 25 11:52:52 MDT 2011



----- Original Message -----
> From: "David Sheen" <sheen at usc.edu>
> To: "Torque Users Mailing List" <torqueusers at supercluster.org>
> Sent: Thursday, August 25, 2011 11:16:52 AM
> Subject: Re: [torqueusers] Force a job to rerun after mom has crashed
> 
> The policy here (NIST) is that nodes that crash are taken offline and
> their jobs qdel -p'd.  The administrators like to get to the root of
> any problems before they bring machines back online, which might take
> days.
> 
> I guess that the answer to my question is, it is not possible to do
> what I want to do.
> 

One other scenario. If a mom goes down and you restart the mom with -p, then offline the node the MOM will track the job until it finishes. 

You could also restart the mom with -q and the job will be requeued. But because the mom has lost control of the job it will still continue to run. TORQUE will just stop tracking it.

Ken


More information about the torqueusers mailing list