[torqueusers] clearing corrupt job records

Gareth.Williams at csiro.au Gareth.Williams at csiro.au
Wed Aug 7 20:51:43 MDT 2013


Hi all,

We have a node that is reporting the existence of a couple of jobs with nonsense info:
wil240 at burnet-login:~> pbsnodes -a n026
n026
     state = offline
     np = 12
     ntype = cluster
     jobs = 2/, 5/��k
     status = rectime=1375929927,varattr=,jobs=,state=free,size=140006956kb:144492840kb,netload=1245190319,gres=,loadave=0.00,ncpus=12,physmem=99195396kb,availmem=99048544kb,totmem=101299868kb,idletime=60100,nusers=0,nsessions=0,uname=Linux n026 2.6.32.59-0.7-default #1 SMP 2012-07-13 15:50:56 +0200 x86_64,opsys=sles11,arch=x86_64
     mom_service_port = 15002
     mom_manager_port = 15003
     gpus = 0

We've not been able to tell where this is coming from. The pbs_mom and node have been restarted with no change.  There is nothing in /var/spool/torque/mom_prov/jobs

When new jobs are sent to the node they fail so we've taken it offline and teh problem is not currently critical.

Does anyone know how to recover from this state?

The cluster is running version: 3.0.6

Regards,

Gareth Williams Ph.D.
Outreach and Science Data Manager
eResearch IM&T Advanced Scientific Computing
CSIRO
E Gareth.Williams at csiro.au T +61 3 8601 3804 
www.csiro.au | https://wiki.csiro.au/display/ASC/



More information about the torqueusers mailing list