[torqueusers] Torque not deleting job

Adam Emerich aemerich at us.ibm.com
Mon Apr 23 07:26:24 MDT 2007


I just wanted to give some additional information on tests I ran:

1.  I did try to restart the pbs_mom with a -r to see if it would remedy
the problem and it did not.

2.  The "qsig -s 0 1160" only returned a '0' return code, but the server
still thought the process was there.

3.  "qdel 1160" works to clear the job from the server


Adam Emerich
IBM Corporation - Rochester, MN
Staff Engineer
Office: 030-3 F305
Office: (507) 253-5483
Cell: (507) 358-2999
aemerich at us.ibm.com

"Insanity: doing the same thing over and over again and expecting different
results."   -Albert Einstein

             Chris Samuel                                                  
             <csamuel at vpac.org                                             
             >                                                          To 
             Sent by:                  torqueusers at supercluster.org        
             torqueusers-bounc                                          cc 
             es at supercluster.o                                             
             rg                                                    Subject 
                                       Re: [torqueusers] Torque not        
                                       deleting job                        
             04/21/2007 01:03                                              

On Sat, 21 Apr 2007, Adam Emerich wrote:

Thanks for the replies to myself and Garrick, the plot thickens!

> 1.  root      2015     1  0 08:54 ?        00:00:02
> -> by default pbs_mom is not started with "-r" on our system

The pbs_mom manual page says about starting a pbs_mom with the -r option

                       If the  -r  option  is  used  following  a
                       reboot,  process  IDs (pids) may be reused
                       and MOM may kill a process that is  not  a
                       batch session.

That could be a Bad Thing(tm).  :-)

> 2.  There is no entry in the server log for a failed epilogue or even a
> message that says the job is being terminated (note jobid is now 1160 as
> had to recreate the issue to get more details).  The first failure in the
> log is due to another process being run that was eventually preempted by
> job 1160:

Interesting - anything in the pbs_mom logs on the node about that job ?

> 3.  "qsig -s 0 1160" did not terminate the job from the server's point of
> view.

OK - now that's just plain bizarre - that is supposed to identify whether
not the child process exists for it and unless you've got a process ID
getting recycled (not beyond the realms of possibility) then it should
declare that process dead and clear up.

It certainly does on our RH 7.3, FC5 and SLES 9 clusters!

Long shot - do you have SE Linux enabled ?   If so, can you disable it and
if it still happens ?

 Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager
 Victorian Partnership for Advanced Computing http://www.vpac.org/
 Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia

torqueusers mailing list
torqueusers at supercluster.org

More information about the torqueusers mailing list