[torqueusers] Force a job to rerun after mom has crashed
gus at ldeo.columbia.edu
Wed Aug 24 16:52:32 MDT 2011
Lloyd Brown wrote:
> Can you clarify this statement for me? We frequently take nodes offline
> (pbsnodes -o nodename), when some sort of hardware fault occurs,
> detected outside of Torque/Moab. The pbs_mom is still responsive to
> queries, etc., but the scheduler considers it down, and won't schedule
> anything to it, the node drains, etc., and then we can work on the node,
> and bring it back online. I don't see what the problem would be.
> Unless by "taken offline", you are saying that the pbs_mom is not
> responsive anymore for some reason (node has kernel panic'd; network
> cable unplugged; inexperienced admin messes up iptables; etc.) That I
> could see.
> Lloyd Brown
> Systems Administrator
> Fulton Supercomputing Lab
> Brigham Young University
> On 08/24/2011 04:03 PM, Ken Nielson wrote:
>>> The node has been taken offline by the administrator for testing.
>> Not a good practice with MOMs running jobs.
I was also surprised by that statement.
Like Lloyd, I've been blindly trusting "pbsnodes -o" as a handy tool
to service nodes graciously, after any ongoing jobs finish.
After all, the pbsnodes man page says:
-o Add the OFFLINE state. This is different from being
marked DOWN. OFFLINE prevents new jobs from running on
the specified nodes. This gives the administrator a
tool to hold a node out of service without changing any-
thing else. The OFFLINE state will never be set or
cleared automatically by pbs_server; it is purely for
the manager or operator.
Is there anything wrong with this practice?
More information about the torqueusers