[torqueusers] Torque not deleting job
Chris Samuel
csamuel at vpac.org
Sat Apr 21 00:03:44 MDT 2007
On Sat, 21 Apr 2007, Adam Emerich wrote:
Thanks for the replies to myself and Garrick, the plot thickens!
> 1. root 2015 1 0 08:54 ? 00:00:02 /usr/local/sbin/pbs_mom
> -> by default pbs_mom is not started with "-r" on our system
The pbs_mom manual page says about starting a pbs_mom with the -r option after
reboot:
If the -r option is used following a
reboot, process IDs (pids) may be reused
and MOM may kill a process that is not a
batch session.
That could be a Bad Thing(tm). :-)
> 2. There is no entry in the server log for a failed epilogue or even a
> message that says the job is being terminated (note jobid is now 1160 as I
> had to recreate the issue to get more details). The first failure in the
> log is due to another process being run that was eventually preempted by
> job 1160:
Interesting - anything in the pbs_mom logs on the node about that job ?
> 3. "qsig -s 0 1160" did not terminate the job from the server's point of
> view.
OK - now that's just plain bizarre - that is supposed to identify whether or
not the child process exists for it and unless you've got a process ID
getting recycled (not beyond the realms of possibility) then it should
declare that process dead and clear up.
It certainly does on our RH 7.3, FC5 and SLES 9 clusters!
Long shot - do you have SE Linux enabled ? If so, can you disable it and see
if it still happens ?
cheers!
Chris
--
Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager
Victorian Partnership for Advanced Computing http://www.vpac.org/
Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia
More information about the torqueusers
mailing list