[torqueusers] Torque not deleting job

Chris Samuel csamuel at vpac.org
Sat Apr 21 00:03:44 MDT 2007


On Sat, 21 Apr 2007, Adam Emerich wrote:

Thanks for the replies to myself and Garrick, the plot thickens!

> 1.  root      2015     1  0 08:54 ?        00:00:02 /usr/local/sbin/pbs_mom
> -> by default pbs_mom is not started with "-r" on our system

The pbs_mom manual page says about starting a pbs_mom with the -r option after 
reboot:

                       If the  -r  option  is  used  following  a
                       reboot,  process  IDs (pids) may be reused
                       and MOM may kill a process that is  not  a
                       batch session.

That could be a Bad Thing(tm).  :-)

> 2.  There is no entry in the server log for a failed epilogue or even a
> message that says the job is being terminated (note jobid is now 1160 as I
> had to recreate the issue to get more details).  The first failure in the
> log is due to another process being run that was eventually preempted by
> job 1160:

Interesting - anything in the pbs_mom logs on the node about that job ?

> 3.  "qsig -s 0 1160" did not terminate the job from the server's point of
> view.

OK - now that's just plain bizarre - that is supposed to identify whether or 
not the child process exists for it and unless you've got a process ID 
getting recycled (not beyond the realms of possibility) then it should 
declare that process dead and clear up.

It certainly does on our RH 7.3, FC5 and SLES 9 clusters!

Long shot - do you have SE Linux enabled ?   If so, can you disable it and see 
if it still happens ?

cheers!
Chris
-- 
 Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager
 Victorian Partnership for Advanced Computing http://www.vpac.org/
 Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia



More information about the torqueusers mailing list