[torqueusers] Force a job to rerun after mom has crashed
knielson at adaptivecomputing.com
Wed Aug 24 10:06:11 MDT 2011
----- Original Message -----
> From: "\"Mgr. Šimon Tóth\"" <toth at fi.muni.cz>
> To: "Torque Users Mailing List" <torqueusers at supercluster.org>
> Sent: Wednesday, August 24, 2011 9:27:45 AM
> Subject: Re: [torqueusers] Force a job to rerun after mom has crashed
> > Is there any straightforward way to force a job to rerun on a
> > different node after its MOM has crashed?
> This is a PBS Pro feature not supported in Torque.
> But in Torque, when a node crashes, it doesn't really mean anything.
> Once the pbs_mom process is restarted, it will detect the jobs and
> reattach them.
> Mgr. Simon Toth
What Simon says is correct but there are other options. Read the man page for pbs_mom. Read options -p, -P, -q and -r. What the mom can do on a restart depends on why the mom went down and for how long. For instance if the mom crashes and restarts immediately the -p (default in 2.4 and later) is probably what you want. But if the failure is because of a system crash you may want the start the mom with the -q option which will requeue jobs so they can be rerun.
More information about the torqueusers