[torqueusers] segfault pbs_mom cleaning up stale jobs v3.0.2
Gareth.Williams at csiro.au
Gareth.Williams at csiro.au
Tue Aug 23 07:31:50 MDT 2011
We are running 3.0.2 on a cluster (not a single image system) and recently had a shared filesystem failure which caused jobs to crash and pbs_mom failures. Note that the shared filesystem hosts the pbs_mom binaries and prologue/epilogue/health_check but /var/spool/torque is local.
When trying to start pbs_mom on nodes with failed jobs, we had a segfault:
Aug 23 18:10:58 n001 kernel: [180378.620287] pbs_mom[10927]: segfault at 0 ip 000000000041c7ca sp 00007fffc9806d50 error 4 in pbs_mom[400000+53000]
With the last entry in the mom_logs (different node):
08/23/2011 23:02:24;0080; pbs_mom;Svr;pbs_mom;before init_abort_jobs
If we purge they jobs and remove the job entries from the compute node, then the pbs_mom starts OK.
Before I go investigating further, is this a problem that has been identified (and fixed) in 2.x?
Gareth
More information about the torqueusers
mailing list