[torqueusers] clearing corrupt job records

Ken Nielson knielson at adaptivecomputing.com
Thu Aug 8 08:20:53 MDT 2013


On Wed, Aug 7, 2013 at 8:51 PM, <Gareth.Williams at csiro.au> wrote:

> Hi all,
>
> We have a node that is reporting the existence of a couple of jobs with
> nonsense info:
> wil240 at burnet-login:~> pbsnodes -a n026
> n026
>      state = offline
>      np = 12
>      ntype = cluster
>      jobs = 2/, 5/��k
>      status =
> rectime=1375929927,varattr=,jobs=,state=free,size=140006956kb:144492840kb,netload=1245190319,gres=,loadave=0.00,ncpus=12,physmem=99195396kb,availmem=99048544kb,totmem=101299868kb,idletime=60100,nusers=0,nsessions=0,uname=Linux
> n026 2.6.32.59-0.7-default #1 SMP 2012-07-13 15:50:56 +0200
> x86_64,opsys=sles11,arch=x86_64
>      mom_service_port = 15002
>      mom_manager_port = 15003
>      gpus = 0
>
> We've not been able to tell where this is coming from. The pbs_mom and
> node have been restarted with no change.  There is nothing in
> /var/spool/torque/mom_prov/jobs
>
> When new jobs are sent to the node they fail so we've taken it offline and
> teh problem is not currently critical.
>
> Does anyone know how to recover from this state?
>
> The cluster is running version: 3.0.6
>
> Regards,
>
> Gareth Williams Ph.D.
> Outreach and Science Data Manager
> eResearch IM&T Advanced Scientific Computing
> CSIRO
> E Gareth.Williams at csiro.au T +61 3 8601 3804
> www.csiro.au | https://wiki.csiro.au/display/ASC/
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>

Gareth,

Since there are no jobs in the jobs directory it seems it would be safe to
restart the mom. Have you tried that?

-- 
Ken Nielson
+1 801.717.3700 office +1 801.717.3738 fax
1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
www.adaptivecomputing.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130808/5ca445e5/attachment.html 


More information about the torqueusers mailing list