[torqueusers] clearing corrupt job records

Roman Baranowski roman at chem.ubc.ca
Thu Aug 8 12:38:29 MDT 2013


 	Dear Gareth,

The issue here can be torque server related (restart of pbs_mom does not 
fix the issue). If you cannot restart pbs_server,  you can try removing 
the node using the qmgr and then to create the node again.

 	All the best
 	Roman


On Thu, 8 Aug 2013, Ken Nielson wrote:

> 
> On Wed, Aug 7, 2013 at 8:51 PM, <Gareth.Williams at csiro.au> wrote:
>       Hi all,
>
>       We have a node that is reporting the existence of a couple of jobs with nonsense info:
>       wil240 at burnet-login:~> pbsnodes -a n026
>       n026
>            state = offline
>            np = 12
>            ntype = cluster
>            jobs = 2/, 5/??????k
>            status =rectime=1375929927,varattr=,jobs=,state=free,size=140006956kb:144492840kb,netload=1245190319,gres=,loadave=0.00,ncpus=12,physmem=99195396kb,availmem=990485
>       44kb,totmem=101299868kb,idletime=60100,nusers=0,nsessions=0,uname=Linux n026 2.6.32.59-0.7-default #1 SMP 2012-07-13 15:50:56 +0200
>       x86_64,opsys=sles11,arch=x86_64
>            mom_service_port = 15002
>            mom_manager_port = 15003
>            gpus = 0
>
>       We've not been able to tell where this is coming from. The pbs_mom and node have been restarted with no change.  There is nothing in
>       /var/spool/torque/mom_prov/jobs
>
>       When new jobs are sent to the node they fail so we've taken it offline and teh problem is not currently critical.
>
>       Does anyone know how to recover from this state?
>
>       The cluster is running version: 3.0.6
>
>       Regards,
>
>       Gareth Williams Ph.D.
>       Outreach and Science Data Manager
>       eResearch IM&T Advanced Scientific Computing
>       CSIRO
>       E Gareth.Williams at csiro.au T +61 3 8601 3804
>       www.csiro.au | https://wiki.csiro.au/display/ASC/
>
>       _______________________________________________
>       torqueusers mailing list
>       torqueusers at supercluster.org
>       http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> 
> Gareth,
> 
> Since there are no jobs in the jobs directory it seems it would be safe to restart the mom. Have you tried that?
> 
> --
> Ken Nielson
> +1 801.717.3700 office +1 801.717.3738 fax
> 1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
> www.adaptivecomputing.com
> 
> 
>


More information about the torqueusers mailing list