[torqueusers] Force a job to rerun after mom has crashed

David Sheen sheen at usc.edu
Wed Aug 24 12:03:19 MDT 2011


Thanks for the helpful suggestions.  What I'm looking for, actually,
is a way to get the job to run on a different node.  The node in
question has been taken offline and will be unavailable until further
notice.

Obviously, I could just resubmit the job (which I did).  However, I
use bash scripts to automatically generate large numbers of qsub
commands, and then pass variables to the script to tell it what job it
is.  Something like:

qsub script.sh -N $series$number -v
name=$series$number,np=$np,other=$more_variables_in_call_to_qsub

If one job in the list (there are 200 or so) fails because the node
crashes, it's irritating to rebuild this variable list by hand.  If
ten jobs crash, it's worth my time to resubmit all 200 rather than try
to redo those ten.  It would be much easier to tell torque to run the
job, with all of its variables saved, somewhere else!

On Wed, Aug 24, 2011 at 12:06 PM, Ken Nielson
<knielson at adaptivecomputing.com> wrote:
>
>
> ----- Original Message -----
>> From: "\"Mgr. Šimon Tóth\"" <toth at fi.muni.cz>
>> To: "Torque Users Mailing List" <torqueusers at supercluster.org>
>> Sent: Wednesday, August 24, 2011 9:27:45 AM
>> Subject: Re: [torqueusers] Force a job to rerun after mom has crashed
>> > Is there any straightforward way to force a job to rerun on a
>> > different node after its MOM has crashed?
>>
>> This is a PBS Pro feature not supported in Torque.
>>
>> But in Torque, when a node crashes, it doesn't really mean anything.
>> Once the pbs_mom process is restarted, it will detect the jobs and
>> reattach them.
>>
>> --
>> Mgr. Simon Toth
>> _______________________________________________
>
> What Simon says is correct but there are other options. Read the man page for pbs_mom. Read options -p, -P, -q and -r. What the mom can do on a restart depends on why the mom went down and for how long. For instance if the mom crashes and restarts immediately the -p (default in 2.4 and later) is probably what you want. But if the failure is because of a system crash you may want the start the mom with the -q option which will requeue jobs so they can be rerun.
>
> Regards
>
> Ken
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>


More information about the torqueusers mailing list