[torqueusers] pbs_mom request, was Re: PBS_MOM kills running jobs when restarted
jdsmit at sandia.gov
Mon Dec 14 10:12:08 MST 2009
I manage a 4480 node torque cluster, and for the longest time, started
pbs_mom on boot without any significant problems.
In the original deployment there were issues with the ping/hello flood,
but working with Garrick and CRI, we got past that.
Now we have a startup item that does a few checks ( filesystem
availability, some OS checks etc ) that starts the pbs_mom at boot time
if all tests pass.
It adds a bit of sanity to the start-up process, but does not take admin
I too would be interested in what problems you have seen, as we had few
to none after fixing the timings.
Sandia Nationall Labs
Douglas Needham wrote:
> On Fri, 2009-12-11 at 14:04 +1100, Chris Samuel wrote:
>> I would argue that you should never start pbs_mom on
>> boot, ever.
>> We only know of one cluster where that is done and it
>> causes persistent problems for all sorts of reasons. :(
> I would like to hear the details on this. Would you be willing to
> highlight some of the issues at least?
> >From personal experience (I was the developer responsible for the 1200+
> UNIX nodes at CompuServe years ago, and the one to whom operations came
> with complaints, RFEs, etc.), it seems to me that with a cluster having
> a sufficient number of nodes, the administrative cost of having to take
> steps to start pbs_mom could soon become consuming. I know of one major
> cluster which has a scheduled power outage in the coming weeks, and even
> having to start just one process per node, even using some script from
> an admin node, could mean an hour or more of additional downtime.
> - Doug
> torqueusers mailing list
> torqueusers at supercluster.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the torqueusers