[torqueusers] Force a job to rerun after mom has crashed

Ken Nielson knielson at adaptivecomputing.com
Wed Aug 24 12:12:20 MDT 2011



----- Original Message -----
> From: "David Sheen" <sheen at usc.edu>
> To: "Torque Users Mailing List" <torqueusers at supercluster.org>
> Sent: Wednesday, August 24, 2011 12:03:19 PM
> Subject: Re: [torqueusers] Force a job to rerun after mom has crashed
> Thanks for the helpful suggestions. What I'm looking for, actually,
> is a way to get the job to run on a different node. The node in
> question has been taken offline and will be unavailable until further
> notice.
> 
> Obviously, I could just resubmit the job (which I did). However, I
> use bash scripts to automatically generate large numbers of qsub
> commands, and then pass variables to the script to tell it what job it
> is. Something like:
> 
> qsub script.sh -N $series$number -v
> name=$series$number,np=$np,other=$more_variables_in_call_to_qsub
> 
> If one job in the list (there are 200 or so) fails because the node
> crashes, it's irritating to rebuild this variable list by hand. If
> ten jobs crash, it's worth my time to resubmit all 200 rather than try
> to redo those ten. It would be much easier to tell torque to run the
> job, with all of its variables saved, somewhere else!
> 
If you restart the mom with the -q option all jobs that were running at the time of the crash will be requeued and eligible to be rerun. They will also keep their name.

Ken


More information about the torqueusers mailing list