[torqueusers] pbs_mom request, was Re: PBS_MOM kills running jobs when restarted

Bogdan Costescu bcostescu at gmail.com
Mon Dec 14 04:23:14 MST 2009


I am also of the opinion that pbs_mom should be started automatically.
Some discussion below:

On Sat, Dec 12, 2009 at 12:37 PM, Chris Samuel <csamuel at vpac.org> wrote:
> Firstly, as Glen mentioned, a node that goes bad and
> reboots under load will drain your queue through the
> reboot->accept job->reboot loop. :-(

If it was unstable enough to fail finishing the current job, why
should I trust it with the next job ? Indeed but, by not starting
pbs_mom, aren't you using the wrong tool for this problem ? IMHO a
node that reboots spontaneously (IOW, not directed by the sysadmin)
should automatically be marked as offline. Isn't the offline state
supposed to be used for exactly for this kind of situations ?

If the node is offline, upon coming back up, it will not be used for
jobs - so no chance of draining the queue.

> Secondly we've seen MPI jobs fail where the default
> resource limit on the amount of memory that can be
> locked causes job initialisation to fail.  For some
> reason even inserting a "ulimit -l unlimited" into
> the init.d script before it starts the pbs_mom didn't
> seem to fix it.

And how does it work differently if you start pbs_mom manually ?

> Thirdly, if a node does go bad and reboot then it
> makes diagnosis and troubleshooting a lot easier if
> the node has no jobs on it.

If the node is offline-d upon unexpected reboot, it would still remain
empty and ready for testing.

When you have a larger number of nodes, do you check each and every of
them before starting pbs_mom manually ? Can this checking be automated
so that it can be part of the init.d script ? Or part of some node
health monitoring tool that can then trigger a pbs_mom start ?


More information about the torqueusers mailing list