[torqueusers] jobs stuck in "W" state

gianfranco sciacca gs at hep.ucl.ac.uk
Thu Feb 23 09:01:16 MST 2006


Hello torqueusers,
I have currently in one our queues a number of jobs that qstat reports
in "W" state since a couple of days. On closer examination, qstat -f
shows that each has an assigned execution node "exec_host".

However, nothing appears on the mom logs of that execution node and no
corresponding entries in the mom_priv/jobs directory. Using tracejob on
the server, I find that:

=====
gs> tracejob -n 3 142774.pc72
 
Job: 142774.pc72.hep.ucl.ac.uk
 
02/21/2006 03:25:09  S    enqueuing into lcgatlas, state 1 hop 1
02/21/2006 03:25:09  S    Job Queued at request of
atlas004 at pc90.hep.ucl.ac.uk, owner =
                          atlas004 at pc90.hep.ucl.ac.uk, job name = STDIN,
queue = lcgatlas
02/21/2006 03:25:11  L    Queue job limit reached
... <same message repeated several times>
02/21/2006 12:29:29  L    Queue job limit reached
02/23/2006 00:24:48  S    Job Modified at request of
Scheduler at pc72.hep.ucl.ac.uk
02/23/2006 00:24:48  L    Job Run
02/23/2006 00:24:48  S    Job Run at request of
Scheduler at pc72.hep.ucl.ac.uk
=====

Looking up the scheduler log, the job appears as being re-run every half
an hour:

=====
..... <snip>
02/23/2006 14:33:06;0040; pbs_sched;Job;142774.pc72.hep.ucl.ac.uk;Job
Run
02/23/2006 15:03:38;0040; pbs_sched;Job;142774.pc72.hep.ucl.ac.uk;Job
Run
02/23/2006 15:33:50;0040; pbs_sched;Job;142774.pc72.hep.ucl.ac.uk;Job
Run
=====

Anyone able to spot what's going on and provide a possible cure?

Thanks,
gianfranco
-- 
Dr. Gianfranco Sciacca			Tel: +44 (0)20 7679 3044
Dept of Physics and Astronomy		Internal: 33044
University College London		D15 - Physics Building
London WC1E 6BT



More information about the torqueusers mailing list