[torqueusers] segfault pbs_mom cleaning up stale jobs v3.0.2

Gareth.Williams at csiro.au Gareth.Williams at csiro.au
Wed Aug 24 17:57:40 MDT 2011


> -----Original Message-----
> From: Ken Nielson [mailto:knielson at adaptivecomputing.com]
> Sent: Wednesday, 24 August 2011 10:27 PM
> To: Torque Users Mailing List
> Subject: Re: [torqueusers] segfault pbs_mom cleaning up stale jobs
> v3.0.2
> 
> ----- Original Message -----
> > From: "Gareth Williams" <Gareth.Williams at csiro.au>
> > To: torqueusers at supercluster.org
> > Sent: Tuesday, August 23, 2011 7:31:50 AM
> > Subject: [torqueusers] segfault pbs_mom cleaning up stale jobs v3.0.2
> > We are running 3.0.2 on a cluster (not a single image system) and
> > recently had a shared filesystem failure which caused jobs to crash
> > and pbs_mom failures. Note that the shared filesystem hosts the
> > pbs_mom binaries and prologue/epilogue/health_check but
> > /var/spool/torque is local.
> >
> > When trying to start pbs_mom on nodes with failed jobs, we had a
> > segfault:
> > Aug 23 18:10:58 n001 kernel: [180378.620287] pbs_mom[10927]: segfault
> > at 0 ip 000000000041c7ca sp 00007fffc9806d50 error 4 in
> > pbs_mom[400000+53000]
> > With the last entry in the mom_logs (different node):
> > 08/23/2011 23:02:24;0080; pbs_mom;Svr;pbs_mom;before init_abort_jobs
> >
> > If we purge they jobs and remove the job entries from the compute
> > node, then the pbs_mom starts OK.
> >
> > Before I go investigating further, is this a problem that has been
> > identified (and fixed) in 2.x?
> >
> > Gareth
> > _______________________________________________
> 
> Gareth,
> 
> Are you using the $thread_unlink_calls as a configuration option on the
> MOMs?
> 
> Ken

Hi Ken,

No.  Just:
arch x86_64
opsys sles11
$usecp *:/data /data
$usecp *:/home /home
$node_check_script /var/spool/torque/mom_priv/node_health.sh
$node_check_interval 4
$remote_reconfig 1

Gareth


More information about the torqueusers mailing list