[torqueusers] pbs_mom request, was Re: PBS_MOM kills running jobs when restarted

Douglas Wade Needham dneedham at cmu.edu
Mon Dec 14 08:32:47 MST 2009


On Fri, 2009-12-11 at 12:31 -0500, Glen Beane wrote:
> If you have a flaky node that reboots itself due to some hardware
> problem you don't want it to come back up and start pbs_mom and start
> accepting jobs until the node has been tested and the faulty hardware
> replaced

I can certainly see where that would be a concern.  However, given my
experiences at places like CompuServe and here where I am right now,
where we had 500+ servers, I am thinking that the vast majority of the
hardware faults did not behave that way.  I have seen them either locked
the machine up, made it so that it would not boot, or where a reboot
resulted, had the OS stop and prompt the admin when a fsck type check
failed.  And even then, forced downtime for things like facilities
problems (cooling, electrical, etc.) caused the per-node counts to be
biased towards them, even though the total number of incidents were far
smaller.  Unfortunately, I do not have the metrics available to break it
down numerically (this is something we are starting to try to collect
formally where I currently work, breaking it down into things like OS
faults, hardware faults, etc.).

Add to this things like SMART tests, and things like flaky disks are
often found before the host itself starts rebooting itself and coming
back up.  All in all, I would have to say I personally would rather have
pbs_mom start automatically, though I could see where we might want to
have a tunable in the pbs_server where it would perhaps keep track of
the node, but not include it until cleared, when a reboot (but not a
restart of pbs_mom itself) occurred.  This way, qmgr could be used from
one node, rather than having to login to the node(s) in question, to put
those nodes back in the pool.

BTW... I would love to hear other folks inputs on things like this,
especially if they have metrics.  This is the area we are currently
researching.

- Doug



More information about the torqueusers mailing list