[torqueusers] big .TK file

Gareth.Williams at csiro.au Gareth.Williams at csiro.au
Mon Nov 26 16:20:57 MST 2012


Hi All,



In out torque instance on SGI UV: pbs_version = 3.0.4-snap.201201051014

We recently noticed this:



> ls -al /var/spool/torque/mom_priv/jobs/.TK

-rw------- 1 root root 680505410608 Nov 20 20:06 /var/spool/torque/mom_priv/jobs/.TK

The file must be really sparse as it takes up very little du.

The pbs_mom was restarted an hour or so before the timestamp on this file and there are a few errors logged like:

11/20/2012 19:12:00;0001;   pbs_mom;Job;job_nodes;job: 30066.cherax.hpsc.csiro.au numnodes=1 numvnod=1
11/20/2012 19:12:00;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::No such file or directory (2) in task_recov, open of task file
11/20/2012 19:12:00;0080;   pbs_mom;Job;init_abort_jobs;task recovery failed for job 30066.cherax.hpsc.csiro.au, rc=-1
11/20/2012 19:12:00;0080;   pbs_mom;Job;init_abort_jobs;attempting to recover job 30066.cherax.hpsc.csiro.au in state RUNNING

11/20/2012 19:12:03;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::No such file or directory (2) in open_std_file, cannot open/create stdout/stderr file '/var/spool/torque/spool
/30066.cherax.hpsc.csiro.au.ER' (mode: 2001, keeping: FALSE)
11/20/2012 19:12:03;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Inappropriate ioctl for device (25) in message_job, cannot open stderr file for job '30066.cherax.hpsc.csiro.a
u' (msg: '=>> PBS: job killed: walltime 8404959 exceeded limit 16200
11/20/2012 19:12:03;0008;   pbs_mom;Job;kill_job;examine_all_polled_jobs: sending signal 15, "TERM" to job 30066.cherax.hpsc.csiro.au, reason: job is over-limit-0
11/20/2012 19:12:03;0008;   pbs_mom;Job;29928.cherax.hpsc.csiro.au;walltime 8478981 exceeded limit 252000

11/20/2012 19:16:56;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Success (0) in req_quejob, cannot queue new job, job exists and is running
(this last one is happening quite a bit - maybe related to slow start of jobs)


Any ideas?
Is the .TK file from some corruption? - I think we will just delete it.

Gareth

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121127/b990a4ab/attachment-0001.html 


More information about the torqueusers mailing list