[torqueusers] clearing corrupt job records
roman at chem.ubc.ca
Thu Aug 8 12:38:29 MDT 2013
The issue here can be torque server related (restart of pbs_mom does not
fix the issue). If you cannot restart pbs_server, you can try removing
the node using the qmgr and then to create the node again.
All the best
On Thu, 8 Aug 2013, Ken Nielson wrote:
> On Wed, Aug 7, 2013 at 8:51 PM, <Gareth.Williams at csiro.au> wrote:
> Hi all,
> We have a node that is reporting the existence of a couple of jobs with nonsense info:
> wil240 at burnet-login:~> pbsnodes -a n026
> state = offline
> np = 12
> ntype = cluster
> jobs = 2/, 5/??????k
> status =rectime=1375929927,varattr=,jobs=,state=free,size=140006956kb:144492840kb,netload=1245190319,gres=,loadave=0.00,ncpus=12,physmem=99195396kb,availmem=990485
> 44kb,totmem=101299868kb,idletime=60100,nusers=0,nsessions=0,uname=Linux n026 126.96.36.199-0.7-default #1 SMP 2012-07-13 15:50:56 +0200
> mom_service_port = 15002
> mom_manager_port = 15003
> gpus = 0
> We've not been able to tell where this is coming from. The pbs_mom and node have been restarted with no change. There is nothing in
> When new jobs are sent to the node they fail so we've taken it offline and teh problem is not currently critical.
> Does anyone know how to recover from this state?
> The cluster is running version: 3.0.6
> Gareth Williams Ph.D.
> Outreach and Science Data Manager
> eResearch IM&T Advanced Scientific Computing
> E Gareth.Williams at csiro.au T +61 3 8601 3804
> www.csiro.au | https://wiki.csiro.au/display/ASC/
> torqueusers mailing list
> torqueusers at supercluster.org
> Since there are no jobs in the jobs directory it seems it would be safe to restart the mom. Have you tried that?
> Ken Nielson
> +1 801.717.3700 office +1 801.717.3738 fax
> 1712 S. East Bay Blvd, Suite 300 Provo, UT 84606
More information about the torqueusers