[torqueusers] segfault pbs_mom cleaning up stale jobs v3.0.2

Gareth.Williams at csiro.au Gareth.Williams at csiro.au
Tue Aug 23 07:31:50 MDT 2011


We are running 3.0.2 on a cluster (not a single image system) and recently had a shared filesystem failure which caused jobs to crash and pbs_mom failures. Note that the shared filesystem hosts the pbs_mom binaries and prologue/epilogue/health_check but /var/spool/torque is local.

When trying to start pbs_mom on nodes with failed jobs, we had a segfault:
Aug 23 18:10:58 n001 kernel: [180378.620287] pbs_mom[10927]: segfault at 0 ip 000000000041c7ca sp 00007fffc9806d50 error 4 in pbs_mom[400000+53000]
With the last entry in the mom_logs (different node):
08/23/2011 23:02:24;0080;   pbs_mom;Svr;pbs_mom;before init_abort_jobs

If we purge they jobs and remove the job entries from the compute node, then the pbs_mom starts OK.

Before I go investigating further, is this a problem that has been identified (and fixed) in 2.x?

Gareth


More information about the torqueusers mailing list