[torqueusers] can't delete a job from the queue without restarting or SIGHUP'ing all pbs_mom processes on the node where the job ran

Sabuj Pattanayek sabujp at gmail.com
Thu Apr 25 12:58:31 MDT 2013


Hi all,

Anyone know what might be causing a job that has completed or been
qdel'd to not be removed from the output of qstat? For example here's
one such job, here in the mom_logs it shows that it was terminated :


# grep -R -i 4895385 *
mom_logs/20130425:04/25/2013 00:48:03;0001;
pbs_mom;Job;TMomFinalizeJob3;job 4895385.piranha started, pid = 25069
mom_logs/20130425:04/25/2013 01:33:28;0080;
pbs_mom;Job;4895385.piranha;scan_for_terminated: job 4895385.piranha
task 1 terminated, sid=25069
mom_logs/20130425:04/25/2013 01:33:28;0008;
pbs_mom;Job;4895385.piranha;job was terminated

and on that node there are two pbs_mom processes :

root      3180  0.0  0.1  45096 28116 ?        SLsl Apr10   6:21
/usr/local/sbin/pbs_mom
root     25991  0.0  0.1  45020 25956 ?        S    01:33   0:00
/usr/local/sbin/pbs_mom

If I killall -1 pbs_mom, the more recently started pbs_mom (from 1:33
AM today) will terminate and then the job will be removed from
pbs_server's qstat output. I saw this :

http://www.cs.sandia.gov/cplant/doc/runtimeAdmin1_0/node82.html

..and I guess the pbs_mom is somehow not sending pbs_server the
obituary, but why not? No processes that belong to the user of the job
are still running on the node.

Thanks,
Sabuj


More information about the torqueusers mailing list