[torqueusers] Force a job to rerun after mom has crashed

Ken Nielson knielson at adaptivecomputing.com
Wed Aug 24 17:27:22 MDT 2011


----- Original Message -----
> From: "Gus Correa" <gus at ldeo.columbia.edu>
> To: "Torque Users Mailing List" <torqueusers at supercluster.org>
> Sent: Wednesday, August 24, 2011 4:52:32 PM
> Subject: Re: [torqueusers] Force a job to rerun after mom has crashed
> Lloyd Brown wrote:
> > Ken,
> >
> > Can you clarify this statement for me? We frequently take nodes
> > offline
> > (pbsnodes -o nodename), when some sort of hardware fault occurs,
> > detected outside of Torque/Moab. The pbs_mom is still responsive to
> > queries, etc., but the scheduler considers it down, and won't
> > schedule
> > anything to it, the node drains, etc., and then we can work on the
> > node,
> > and bring it back online. I don't see what the problem would be.
> >
> > Unless by "taken offline", you are saying that the pbs_mom is not
> > responsive anymore for some reason (node has kernel panic'd; network
> > cable unplugged; inexperienced admin messes up iptables; etc.) That
> > I
> > could see.
> >
> > Lloyd Brown
> > Systems Administrator
> > Fulton Supercomputing Lab
> > Brigham Young University
> > http://marylou.byu.edu
> >
> > On 08/24/2011 04:03 PM, Ken Nielson wrote:
> >>> The node has been taken offline by the administrator for testing.
> >>>> David
> >>>>
> >>>>
> >> Not a good practice with MOMs running jobs.
> > _______________________________________________
> 
> Hi All
> 
> I was also surprised by that statement.
> Like Lloyd, I've been blindly trusting "pbsnodes -o" as a handy tool
> to service nodes graciously, after any ongoing jobs finish.
> After all, the pbsnodes man page says:
> 
> -o Add the OFFLINE state. This is different from being
> marked DOWN. OFFLINE prevents new jobs from running on
> the specified nodes. This gives the administrator a
> tool to hold a node out of service without changing any-
> thing else. The OFFLINE state will never be set or
> cleared automatically by pbs_server; it is purely for
> the manager or operator.
> 
> Is there anything wrong with this practice?
> 
> Many thanks,
> Gus Correa

This may be my misunderstanding. This thread started with the crash of a MOM and so I was still thinking in terms of a crashed or downed MOM node. As far as setting a node to offline any running jobs will continue to run. If the offline is cleared on the node the job will still continue to run uninterrupted. 

Sorry for any confusion.

Ken


More information about the torqueusers mailing list