[torqueusers] segfault pbs_mom cleaning up stale jobs v3.0.2

Ken Nielson knielson at adaptivecomputing.com
Wed Aug 24 06:26:36 MDT 2011



----- Original Message -----
> From: "Gareth Williams" <Gareth.Williams at csiro.au>
> To: torqueusers at supercluster.org
> Sent: Tuesday, August 23, 2011 7:31:50 AM
> Subject: [torqueusers] segfault pbs_mom cleaning up stale jobs v3.0.2
> We are running 3.0.2 on a cluster (not a single image system) and
> recently had a shared filesystem failure which caused jobs to crash
> and pbs_mom failures. Note that the shared filesystem hosts the
> pbs_mom binaries and prologue/epilogue/health_check but
> /var/spool/torque is local.
> 
> When trying to start pbs_mom on nodes with failed jobs, we had a
> segfault:
> Aug 23 18:10:58 n001 kernel: [180378.620287] pbs_mom[10927]: segfault
> at 0 ip 000000000041c7ca sp 00007fffc9806d50 error 4 in
> pbs_mom[400000+53000]
> With the last entry in the mom_logs (different node):
> 08/23/2011 23:02:24;0080; pbs_mom;Svr;pbs_mom;before init_abort_jobs
> 
> If we purge they jobs and remove the job entries from the compute
> node, then the pbs_mom starts OK.
> 
> Before I go investigating further, is this a problem that has been
> identified (and fixed) in 2.x?
> 
> Gareth
> _______________________________________________

Gareth,

Are you using the $thread_unlink_calls as a configuration option on the MOMs?

Ken


More information about the torqueusers mailing list