[torqueusers] Force a job to rerun after mom has crashed

David Sheen sheen at usc.edu
Thu Aug 25 11:16:52 MDT 2011


The policy here (NIST) is that nodes that crash are taken offline and
their jobs qdel -p'd.  The administrators like to get to the root of
any problems before they bring machines back online, which might take
days.

I guess that the answer to my question is, it is not possible to do
what I want to do.

On Thu, Aug 25, 2011 at 12:52 PM, Lloyd Brown <lloyd_brown at byu.edu> wrote:
> That's kinda what I figured.
>
> I do agree that routinely using "qdel -p" to clear a job from a downed
> node is a bad, but sometimes (very, very rarely) required practice.  In
> the 6 or so years I've been working with Torque, I think I've done it 2
> or 3 times total, and we've literally handled several million jobs at
> least, during that time.
>
> Thanks for clarifying.
>
> Lloyd Brown
> Systems Administrator
> Fulton Supercomputing Lab
> Brigham Young University
> http://marylou.byu.edu
>
> On 08/24/2011 05:27 PM, Ken Nielson wrote:
>> This may be my misunderstanding. This thread started with the crash of a MOM and so I was still thinking in terms of a crashed or downed MOM node. As far as setting a node to offline any running jobs will continue to run. If the offline is cleared on the node the job will still continue to run uninterrupted.
>>
>> Sorry for any confusion.
>>
>> Ken
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>


More information about the torqueusers mailing list