[torqueusers] big .TK file

Ken Nielson knielson at adaptivecomputing.com
Wed Nov 28 12:06:39 MST 2012


Gareth,

The TK is a left over task file. I would delete it.

Ken

On Mon, Nov 26, 2012 at 4:20 PM, <Gareth.Williams at csiro.au> wrote:

> Hi All,****
>
> ** **
>
> In out torque instance on SGI UV: pbs_version = 3.0.4-snap.201201051014***
> *
>
> We recently noticed this:****
>
> ** **
>
> > ls -al /var/spool/torque/mom_priv/jobs/.TK****
>
> -rw------- 1 root root 680505410608 Nov 20 20:06
> /var/spool/torque/mom_priv/jobs/.TK****
>
> ** **
>
> The file must be really sparse as it takes up very little du.****
>
> ** **
>
> The pbs_mom was restarted an hour or so before the timestamp on this file
> and there are a few errors logged like:****
>
> ** **
>
> 11/20/2012 19:12:00;0001;   pbs_mom;Job;job_nodes;job:
> 30066.cherax.hpsc.csiro.au numnodes=1 numvnod=1****
>
> 11/20/2012 19:12:00;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::No such file or
> directory (2) in task_recov, open of task file****
>
> 11/20/2012 19:12:00;0080;   pbs_mom;Job;init_abort_jobs;task recovery
> failed for job 30066.cherax.hpsc.csiro.au, rc=-1****
>
> 11/20/2012 19:12:00;0080;   pbs_mom;Job;init_abort_jobs;attempting to
> recover job 30066.cherax.hpsc.csiro.au in state RUNNING****
>
> ** **
>
> 11/20/2012 19:12:03;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::No such file or
> directory (2) in open_std_file, cannot open/create stdout/stderr file
> '/var/spool/torque/spool****
>
> /30066.cherax.hpsc.csiro.au.ER' (mode: 2001, keeping: FALSE)****
>
> 11/20/2012 19:12:03;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Inappropriate
> ioctl for device (25) in message_job, cannot open stderr file for job
> '30066.cherax.hpsc.csiro.a****
>
> u' (msg: '=>> PBS: job killed: walltime 8404959 exceeded limit 16200****
>
> 11/20/2012 19:12:03;0008;   pbs_mom;Job;kill_job;examine_all_polled_jobs:
> sending signal 15, "TERM" to job 30066.cherax.hpsc.csiro.au, reason: job
> is over-limit-0****
>
> 11/20/2012 19:12:03;0008;   pbs_mom;Job;29928.cherax.hpsc.csiro.au;walltime
> 8478981 exceeded limit 252000****
>
> ** **
>
> 11/20/2012 19:16:56;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Success (0) in
> req_quejob, cannot queue new job, job exists and is running****
>
> (this last one is happening quite a bit – maybe related to slow start of
> jobs)****
>
> ** **
>
> ** **
>
> Any ideas?****
>
> Is the .TK file from some corruption? – I think we will just delete it.***
> *
>
> ** **
>
> Gareth****
>
> ** **
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121128/ac4a19ae/attachment-0001.html 


More information about the torqueusers mailing list