[torqueusers] jobs stuck in "W" state
gianfranco sciacca
gs at hep.ucl.ac.uk
Thu Feb 23 09:01:16 MST 2006
Hello torqueusers,
I have currently in one our queues a number of jobs that qstat reports
in "W" state since a couple of days. On closer examination, qstat -f
shows that each has an assigned execution node "exec_host".
However, nothing appears on the mom logs of that execution node and no
corresponding entries in the mom_priv/jobs directory. Using tracejob on
the server, I find that:
=====
gs> tracejob -n 3 142774.pc72
Job: 142774.pc72.hep.ucl.ac.uk
02/21/2006 03:25:09 S enqueuing into lcgatlas, state 1 hop 1
02/21/2006 03:25:09 S Job Queued at request of
atlas004 at pc90.hep.ucl.ac.uk, owner =
atlas004 at pc90.hep.ucl.ac.uk, job name = STDIN,
queue = lcgatlas
02/21/2006 03:25:11 L Queue job limit reached
... <same message repeated several times>
02/21/2006 12:29:29 L Queue job limit reached
02/23/2006 00:24:48 S Job Modified at request of
Scheduler at pc72.hep.ucl.ac.uk
02/23/2006 00:24:48 L Job Run
02/23/2006 00:24:48 S Job Run at request of
Scheduler at pc72.hep.ucl.ac.uk
=====
Looking up the scheduler log, the job appears as being re-run every half
an hour:
=====
..... <snip>
02/23/2006 14:33:06;0040; pbs_sched;Job;142774.pc72.hep.ucl.ac.uk;Job
Run
02/23/2006 15:03:38;0040; pbs_sched;Job;142774.pc72.hep.ucl.ac.uk;Job
Run
02/23/2006 15:33:50;0040; pbs_sched;Job;142774.pc72.hep.ucl.ac.uk;Job
Run
=====
Anyone able to spot what's going on and provide a possible cure?
Thanks,
gianfranco
--
Dr. Gianfranco Sciacca Tel: +44 (0)20 7679 3044
Dept of Physics and Astronomy Internal: 33044
University College London D15 - Physics Building
London WC1E 6BT
More information about the torqueusers
mailing list