[torqueusers] Force a job to rerun after mom has crashed

Gus Correa gus at ldeo.columbia.edu
Wed Aug 24 16:52:32 MDT 2011


Lloyd Brown wrote:
> Ken,
> 
> Can you clarify this statement for me?  We frequently take nodes offline
> (pbsnodes -o nodename), when some sort of hardware fault occurs,
> detected outside of Torque/Moab.  The pbs_mom is still responsive to
> queries, etc., but the scheduler considers it down, and won't schedule
> anything to it, the node drains, etc., and then we can work on the node,
> and bring it back online.  I don't see what the problem would be.
> 
> Unless by "taken offline", you are saying that the pbs_mom is not
> responsive anymore for some reason (node has kernel panic'd; network
> cable unplugged; inexperienced admin messes up iptables; etc.)  That I
> could see.
> 
> Lloyd Brown
> Systems Administrator
> Fulton Supercomputing Lab
> Brigham Young University
> http://marylou.byu.edu
> 
> On 08/24/2011 04:03 PM, Ken Nielson wrote:
>>> The node has been taken offline by the administrator for testing.
>>>> David
>>>>
>>>>
>> Not a good practice with MOMs running jobs.
> _______________________________________________

Hi All

I was also surprised by that statement.
Like Lloyd, I've been blindly trusting "pbsnodes -o" as a handy tool
to service nodes graciously, after any ongoing jobs finish.
After all, the pbsnodes man page says:

        -o  Add the OFFLINE state.  This  is  different  from  being
            marked  DOWN.  OFFLINE prevents new jobs from running on
            the specified nodes.  This  gives  the  administrator  a
            tool to hold a node out of service without changing any-
            thing else.  The OFFLINE state  will  never  be  set  or
            cleared  automatically  by  pbs_server; it is purely for
            the manager or operator.

Is there anything wrong with this practice?

Many thanks,
Gus Correa


More information about the torqueusers mailing list