Bugzilla – Bug 124
pbs_mom healthcheck scripts should run from a forked process
Last modified: 2011-07-22 00:27:24 MDT
You need to log in before you can comment on or make changes to this bug.
It is quite possible for the healthcheck scripts on some systems to hang. For instance on our IBM iDataPlex dx360 M2's the suggested way to check for memory errors is to interrogate the IPMI event log with ipmitool. Unfortunately it seems to be quite common for this process to hang and if it should do so the pbs_mom process stops and you lose jobs because the communications between the sisters times out and kills the job. Ideally the pbs_mom should run its health check scripts in another process forked from the original pbs_mom to avoid such problems. If such a scheduled healthcheck process was still running when it was time to run another one it could flag the problem with something like: ERROR healthcheck script hung? It would need a way to handle clashes with scripts running at job start or end time (perhaps just skipping it if it's a clash between an event driven run and a scheduled one or between two event drive ones). As a work around we are going to look at different ways of running our healthcheck scripts to mitigate this, but it'd be really nice for it to do the right thing initially!
I'm currently working on a brand new framework and implementation for node health checks here at LBL, so related topics are of particular interest to me right now. Seems to me like the simplest solution to this problem is to make sure your node health check script doesn't hang. There are multiple facets to this approach: (1) fork only when absolutely necessary, and (2) set an alarm on the script that will kill it after a relatively brief timeout. You've already pointed out just a few of the concurrency challenges associated with trying to background the node health check. Brings to mind the old adage about an ounce of prevention.... :-)
Hi Michael, We took a 3rd approach to this, we now run our health check scripts from cron every 15 minutes and they write their output to a local file (in /etc). The script that the compute nodes run now only output the content of those files (and raise an error if it's not there!). Works well for us now..
Agree this should be handled by the healthcheck scripts not pbs_mom. Closing as resolved, invalid.