[torquedev] [Bug 124] New: pbs_mom healthcheck scripts should run from a forked process
bugzilla-daemon at supercluster.org
bugzilla-daemon at supercluster.org
Wed Apr 20 23:53:36 MDT 2011
http://www.clusterresources.com/bugzilla/show_bug.cgi?id=124
Summary: pbs_mom healthcheck scripts should run from a forked
process
Product: TORQUE
Version: 2.4.x
Platform: Other
OS/Version: Linux
Status: NEW
Severity: enhancement
Priority: P5
Component: pbs_mom
AssignedTo: knielson at adaptivecomputing.com
ReportedBy: chris at csamuel.org
CC: torquedev at supercluster.org
Estimated Hours: 0.0
It is quite possible for the healthcheck scripts on some systems to hang.
For instance on our IBM iDataPlex dx360 M2's the suggested way to check for
memory errors is to interrogate the IPMI event log with ipmitool. Unfortunately
it seems to be quite common for this process to hang and if it should do so
the pbs_mom process stops and you lose jobs because the communications between
the sisters times out and kills the job.
Ideally the pbs_mom should run its health check scripts in another process
forked from the original pbs_mom to avoid such problems. If such a
scheduled healthcheck process was still running when it was time to run
another one it could flag the problem with something like:
ERROR healthcheck script hung?
It would need a way to handle clashes with scripts running at job start
or end time (perhaps just skipping it if it's a clash between an event
driven run and a scheduled one or between two event drive ones).
As a work around we are going to look at different ways of running our
healthcheck scripts to mitigate this, but it'd be really nice for it to
do the right thing initially!
--
Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
More information about the torquedev
mailing list