[torqueusers] pbs_mom request, was Re: PBS_MOM kills running jobs when restarted

Jerry Smith jdsmit at sandia.gov
Mon Dec 14 10:12:08 MST 2009


I manage a 4480 node torque cluster, and for the longest time, started 
pbs_mom on boot without any significant problems.

In the original deployment there were issues with the ping/hello flood, 
but working with Garrick and CRI, we got past that.

Now we have a startup item that does a few checks ( filesystem 
availability, some OS checks etc ) that starts the pbs_mom at boot time 
if all tests pass. 
It adds a bit of sanity to the start-up process, but does not take admin 
interference.

I too would be interested in what problems you have seen, as we had few 
to none after fixing the timings.

--Jerry
---------------------
Sandia Nationall Labs
Scientific Computing


Douglas Needham wrote:
> On Fri, 2009-12-11 at 14:04 +1100, Chris Samuel wrote:
>   
>> I would argue that you should never start pbs_mom on
>> boot, ever.
>>
>> We only know of one cluster where that is done and it
>> causes persistent problems for all sorts of reasons. :(
>>     
>
> I would like to hear the details on this.  Would you be willing to
> highlight some of the issues at least?  
>
> >From personal experience (I was the developer responsible for the 1200+
> UNIX nodes at CompuServe years ago, and the one to whom operations came
> with complaints, RFEs, etc.), it seems to me that with a cluster having
> a sufficient number of nodes, the administrative cost of having to take
> steps to start pbs_mom could soon become consuming.  I know of one major
> cluster which has a scheduled power outage in the coming weeks, and even
> having to start just one process per node, even using some script from
> an admin node, could mean an hour or more of additional downtime.
>
> - Doug
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20091214/1c4c0765/attachment.html 


More information about the torqueusers mailing list