[torqueusers] qdel will not delete

Rahul Nabar rpnabar at gmail.com
Thu Dec 11 11:47:21 MST 2008


I've had jobs that won't respond to qdel once every so often. Their
"REMAINING-time" on MAUI then becomes negative which was initially
confusing since I thought it was a MAUI bug.

But the root-cause seems to be that PBS will not obey the qdel on this
job. Irrespective of whether I issue it as root or MAUI issues it.

I had one such job today and I debugged it more:  All the sub-nodes
seemed to be up. the mom daemon on each one of these nodes seemed to
be up and running.

The mom_log on the master node though was interesting; It had this snippet:

12/11/2008 11:47:38;0002;   pbs_mom;Svr;im_request;connect from 11.0.1.79:1023
12/11/2008 11:47:38;0008;
pbs_mom;Job;233139.supernova.che.wisc.edu;received request 'KILL_JOB'
from 11.0.1.79:1023
12/11/2008 11:47:38;0008;
pbs_mom;Job;233139.supernova.che.wisc.edu;ERROR:    received request
'KILL_JOB' from 11.0.1.79:1023 for job '233139.supernova.che.wisc.edu'
(job does not exist locally)

The only way I could get this job to delete was to restart the pbs_mom
on that node.

Anyone else who has encountered these symptoms? For me the first clue
was a negative "REMAINING-time" on MAUI and users who complained that
they could not qdel a job. In the past I've achieved the same effect
by removing the relevant foo.supe.JB  and foo.supe.SC  files from the
/var/spool/torque/server_priv/jobs on the master node.
But I don't think that is the best way out. I'd appreciate any other
debug suggestions as well.

-- 
Rahul


More information about the torqueusers mailing list