[torqueusers] Force a job to rerun after mom has crashed

Lloyd Brown lloyd_brown at byu.edu
Thu Aug 25 10:52:23 MDT 2011


That's kinda what I figured.

I do agree that routinely using "qdel -p" to clear a job from a downed
node is a bad, but sometimes (very, very rarely) required practice.  In
the 6 or so years I've been working with Torque, I think I've done it 2
or 3 times total, and we've literally handled several million jobs at
least, during that time.

Thanks for clarifying.

Lloyd Brown
Systems Administrator
Fulton Supercomputing Lab
Brigham Young University
http://marylou.byu.edu

On 08/24/2011 05:27 PM, Ken Nielson wrote:
> This may be my misunderstanding. This thread started with the crash of a MOM and so I was still thinking in terms of a crashed or downed MOM node. As far as setting a node to offline any running jobs will continue to run. If the offline is cleared on the node the job will still continue to run uninterrupted. 
> 
> Sorry for any confusion.
> 
> Ken


More information about the torqueusers mailing list