[torqueusers] pbs_mom request, was Re: PBS_MOM kills running jobs when restarted

Chris Samuel csamuel at vpac.org
Mon Dec 14 19:41:09 MST 2009


/* Re: starting pbs_mom on boot */

----- "Bogdan Costescu" <bcostescu at gmail.com> wrote:

> Indeed but, by not starting pbs_mom, aren't you using
> the wrong tool for this problem ?

Er, not really.  Why take the time to set up the init
scripts for something we don't want to run on boot ?

> IMHO a node that reboots spontaneously (IOW, not
> directed by the sysadmin) should automatically be
> marked as offline.

Well in our case it's marked as "down", which makes
it more obvious that something unplanned is going on.

> Isn't the offline state supposed to be used for
> exactly for this kind of situations ?

We tend to use it much more for scheduled downtime,
either to drain nodes for maintenance/troubleshooting
or if a job has gone crazy.

> If the node is offline, upon coming back up, it
> will not be used for jobs - so no chance of
> draining the queue.

Again, that means you've got to take the time to
both set up scripts to spot a node that's rebooted
due to a problem and also the init scripts for Torque.

Why expend that energy when just not starting it
achieves exactly the outcome that we desire ?

> > Secondly we've seen MPI jobs fail where the default
> > resource limit on the amount of memory that can be
> > locked causes job initialisation to fail.  For some
> > reason even inserting a "ulimit -l unlimited" into
> > the init.d script before it starts the pbs_mom didn't
> > seem to fix it.
> 
> And how does it work differently if you start
> pbs_mom manually ?

It just works.  We can't really say why on their
cluster as we don't have admin access to it, we
just try to help them debug problems..

> When you have a larger number of nodes, do you
> check each and every of them before starting
> pbs_mom manually ?

Well if pbs_mom isn't running it's obvious something
very bad has happened on the node so we do go and see
if we can figure out what has gone wrong.

In Australia we don't tend to have what folks overseas
call large numbers of nodes, our largest cluster has
111 at the moment.

> Can this checking be automated so that it can be
> part of the init.d script ? Or part of some node
> health monitoring tool that can then trigger a
> pbs_mom start ?

We do have health check scripts, and pbs_mom runs
them every 10 minutes or so, but our concern is
always for the unexpected.

It's been a while since we've had a spontaneously
rebooting node but ever since that we've not
bothered to set up pbs_mom to start on boot, it's
just not been worth the possible pain to us.

cheers,
Chris
-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency


More information about the torqueusers mailing list