[torqueusers] pbs_mom request, was Re: PBS_MOM kills running jobs when restarted

Douglas Needham dneedham at cmu.edu
Fri Dec 11 09:52:53 MST 2009

On Fri, 2009-12-11 at 14:04 +1100, Chris Samuel wrote:
> I would argue that you should never start pbs_mom on
> boot, ever.
> We only know of one cluster where that is done and it
> causes persistent problems for all sorts of reasons. :(

I would like to hear the details on this.  Would you be willing to
highlight some of the issues at least?  

>From personal experience (I was the developer responsible for the 1200+
UNIX nodes at CompuServe years ago, and the one to whom operations came
with complaints, RFEs, etc.), it seems to me that with a cluster having
a sufficient number of nodes, the administrative cost of having to take
steps to start pbs_mom could soon become consuming.  I know of one major
cluster which has a scheduled power outage in the coming weeks, and even
having to start just one process per node, even using some script from
an admin node, could mean an hour or more of additional downtime.

- Doug

More information about the torqueusers mailing list