[torqueusers] clearing corrupt job records

Gareth.Williams at csiro.au Gareth.Williams at csiro.au
Mon Aug 12 18:19:30 MDT 2013


> -----Original Message-----
> From: Roman Baranowski [mailto:roman at chem.ubc.ca]
> Sent: Friday, 9 August 2013 4:38 AM
> To: Torque Users Mailing List
> Subject: Re: [torqueusers] clearing corrupt job records
> 
> 
>  	Dear Gareth,
> 
> The issue here can be torque server related (restart of pbs_mom does
> not fix the issue). If you cannot restart pbs_server,  you can try
> removing the node using the qmgr and then to create the node again.
> 
>  	All the best
>  	Roman

Removing the node and creating it again worked (apart from a nasty glitch where qmgr barfed and killed pbs_server - maybe no trailing newline in the qmgr input).

Thanks!

Gareth

BTW. Service restart and node reboot did not help - I could have been clearer on that in my initial post.

> 
> 
> On Thu, 8 Aug 2013, Ken Nielson wrote:
> 
> >
> > On Wed, Aug 7, 2013 at 8:51 PM, <Gareth.Williams at csiro.au> wrote:
> >       Hi all,
> >
> >       We have a node that is reporting the existence of a couple of
> jobs with nonsense info:
> >       wil240 at burnet-login:~> pbsnodes -a n026
> >       n026
> >            state = offline
> >            np = 12
> >            ntype = cluster
> >            jobs = 2/, 5/??????k
> >            status
> =rectime=1375929927,varattr=,jobs=,state=free,size=140006956kb:14449284
> 0kb,netload=1245190319,gres=,loadave=0.00,ncpus=12,physmem=99195396kb,a
> vailmem=990485
> >
> 44kb,totmem=101299868kb,idletime=60100,nusers=0,nsessions=0,uname=Linux
> n026 2.6.32.59-0.7-default #1 SMP 2012-07-13 15:50:56 +0200
> >       x86_64,opsys=sles11,arch=x86_64
> >            mom_service_port = 15002
> >            mom_manager_port = 15003
> >            gpus = 0
> >
> >       We've not been able to tell where this is coming from. The
> pbs_mom and node have been restarted with no change.  There is nothing
> in
> >       /var/spool/torque/mom_prov/jobs
> >
> >       When new jobs are sent to the node they fail so we've taken it
> offline and teh problem is not currently critical.
> >
> >       Does anyone know how to recover from this state?
> >
> >       The cluster is running version: 3.0.6
> >
> >       Regards,
> >
> >       Gareth Williams Ph.D.
> >       Outreach and Science Data Manager
> >       eResearch IM&T Advanced Scientific Computing
> >       CSIRO
> >       E Gareth.Williams at csiro.au T +61 3 8601 3804
> >       www.csiro.au | https://wiki.csiro.au/display/ASC/
> >
> >       _______________________________________________
> >       torqueusers mailing list
> >       torqueusers at supercluster.org
> >       http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> >
> > Gareth,
> >
> > Since there are no jobs in the jobs directory it seems it would be
> safe to restart the mom. Have you tried that?
> >
> > --
> > Ken Nielson
> > +1 801.717.3700 office +1 801.717.3738 fax
> > 1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
> > www.adaptivecomputing.com
> >
> >
> >


More information about the torqueusers mailing list