[torquedev] [Bug 124] New: pbs_mom healthcheck scripts should run from a forked process

bugzilla-daemon at supercluster.org bugzilla-daemon at supercluster.org
Wed Apr 20 23:53:36 MDT 2011


           Summary: pbs_mom healthcheck scripts should run from a forked
           Product: TORQUE
           Version: 2.4.x
          Platform: Other
        OS/Version: Linux
            Status: NEW
          Severity: enhancement
          Priority: P5
         Component: pbs_mom
        AssignedTo: knielson at adaptivecomputing.com
        ReportedBy: chris at csamuel.org
                CC: torquedev at supercluster.org
   Estimated Hours: 0.0

It is quite possible for the healthcheck scripts on some systems to hang.

For instance on our IBM iDataPlex dx360 M2's the suggested way to check for
memory errors is to interrogate the IPMI event log with ipmitool. Unfortunately
it seems to be quite common for this process to hang and if it should do so
the pbs_mom process stops and you lose jobs because the communications between
the sisters times out and kills the job.

Ideally the pbs_mom should run its health check scripts in another process
forked from the original pbs_mom to avoid such problems.  If such a
scheduled healthcheck process was still running when it was time to run
another one it could flag the problem with something like:

ERROR healthcheck script hung?

It would need a way to handle clashes with scripts running at job start
or end time (perhaps just skipping it if it's a clash between an event
driven run and a scheduled one or between two event drive ones).

As a work around we are going to look at different ways of running our
healthcheck scripts to mitigate this, but it'd be really nice for it to
do the right thing initially!

Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

More information about the torquedev mailing list