[torqueusers] clearing corrupt job records
knielson at adaptivecomputing.com
Thu Aug 8 08:20:53 MDT 2013
On Wed, Aug 7, 2013 at 8:51 PM, <Gareth.Williams at csiro.au> wrote:
> Hi all,
> We have a node that is reporting the existence of a couple of jobs with
> nonsense info:
> wil240 at burnet-login:~> pbsnodes -a n026
> state = offline
> np = 12
> ntype = cluster
> jobs = 2/, 5/��k
> status =
> n026 126.96.36.199-0.7-default #1 SMP 2012-07-13 15:50:56 +0200
> mom_service_port = 15002
> mom_manager_port = 15003
> gpus = 0
> We've not been able to tell where this is coming from. The pbs_mom and
> node have been restarted with no change. There is nothing in
> When new jobs are sent to the node they fail so we've taken it offline and
> teh problem is not currently critical.
> Does anyone know how to recover from this state?
> The cluster is running version: 3.0.6
> Gareth Williams Ph.D.
> Outreach and Science Data Manager
> eResearch IM&T Advanced Scientific Computing
> E Gareth.Williams at csiro.au T +61 3 8601 3804
> www.csiro.au | https://wiki.csiro.au/display/ASC/
> torqueusers mailing list
> torqueusers at supercluster.org
Since there are no jobs in the jobs directory it seems it would be safe to
restart the mom. Have you tried that?
+1 801.717.3700 office +1 801.717.3738 fax
1712 S. East Bay Blvd, Suite 300 Provo, UT 84606
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the torqueusers