[torqueusers] pbs_mom request, was Re: PBS_MOM kills running jobs when restarted

Glen Beane glen.beane at gmail.com
Fri Dec 11 10:31:46 MST 2009


On Fri, Dec 11, 2009 at 11:52 AM, Douglas Needham <dneedham at cmu.edu> wrote:
> On Fri, 2009-12-11 at 14:04 +1100, Chris Samuel wrote:
>> I would argue that you should never start pbs_mom on
>> boot, ever.
>>
>> We only know of one cluster where that is done and it
>> causes persistent problems for all sorts of reasons. :(
>
> I would like to hear the details on this.  Would you be willing to
> highlight some of the issues at least?
>
> >From personal experience (I was the developer responsible for the 1200+
> UNIX nodes at CompuServe years ago, and the one to whom operations came
> with complaints, RFEs, etc.), it seems to me that with a cluster having
> a sufficient number of nodes, the administrative cost of having to take
> steps to start pbs_mom could soon become consuming.  I know of one major
> cluster which has a scheduled power outage in the coming weeks, and even
> having to start just one process per node, even using some script from
> an admin node, could mean an hour or more of additional downtime.


If you have a flaky node that reboots itself due to some hardware
problem you don't want it to come back up and start pbs_mom and start
accepting jobs until the node has been tested and the faulty hardware
replaced


More information about the torqueusers mailing list