[torqueusers] Force a job to rerun after mom has crashed

Lloyd Brown lloyd_brown at byu.edu
Wed Aug 24 16:36:39 MDT 2011


Ken,

Can you clarify this statement for me?  We frequently take nodes offline
(pbsnodes -o nodename), when some sort of hardware fault occurs,
detected outside of Torque/Moab.  The pbs_mom is still responsive to
queries, etc., but the scheduler considers it down, and won't schedule
anything to it, the node drains, etc., and then we can work on the node,
and bring it back online.  I don't see what the problem would be.

Unless by "taken offline", you are saying that the pbs_mom is not
responsive anymore for some reason (node has kernel panic'd; network
cable unplugged; inexperienced admin messes up iptables; etc.)  That I
could see.

Lloyd Brown
Systems Administrator
Fulton Supercomputing Lab
Brigham Young University
http://marylou.byu.edu

On 08/24/2011 04:03 PM, Ken Nielson wrote:
>> The node has been taken offline by the administrator for testing.
>> > 
>> > David
>> > 
>> > 
> Not a good practice with MOMs running jobs.


More information about the torqueusers mailing list