Bugzilla – Bug 124
pbs_mom healthcheck scripts should run from a forked process
Last modified: 2011-07-22 00:27:24 MDT
You need to
before you can comment on or make changes to this bug.
It is quite possible for the healthcheck scripts on some systems to hang.
For instance on our IBM iDataPlex dx360 M2's the suggested way to check for
memory errors is to interrogate the IPMI event log with ipmitool. Unfortunately
it seems to be quite common for this process to hang and if it should do so
the pbs_mom process stops and you lose jobs because the communications between
the sisters times out and kills the job.
Ideally the pbs_mom should run its health check scripts in another process
forked from the original pbs_mom to avoid such problems. If such a
scheduled healthcheck process was still running when it was time to run
another one it could flag the problem with something like:
ERROR healthcheck script hung?
It would need a way to handle clashes with scripts running at job start
or end time (perhaps just skipping it if it's a clash between an event
driven run and a scheduled one or between two event drive ones).
As a work around we are going to look at different ways of running our
healthcheck scripts to mitigate this, but it'd be really nice for it to
do the right thing initially!
I'm currently working on a brand new framework and implementation for node
health checks here at LBL, so related topics are of particular interest to me
Seems to me like the simplest solution to this problem is to make sure your
node health check script doesn't hang. There are multiple facets to this
approach: (1) fork only when absolutely necessary, and (2) set an alarm on the
script that will kill it after a relatively brief timeout.
You've already pointed out just a few of the concurrency challenges associated
with trying to background the node health check. Brings to mind the old adage
about an ounce of prevention.... :-)
We took a 3rd approach to this, we now run our health check scripts from cron
every 15 minutes and they write their output to a local file (in /etc).
The script that the compute nodes run now only output the content of those
files (and raise an error if it's not there!).
Works well for us now..
Agree this should be handled by the healthcheck scripts not pbs_mom.
Closing as resolved, invalid.