[torqueusers] Torque not deleting job

Adam Emerich aemerich at us.ibm.com
Fri Apr 20 10:14:40 MDT 2007


Answers:

1.  root      2015     1  0 08:54 ?        00:00:02 /usr/local/sbin/pbs_mom
-> by default pbs_mom is not started with "-r" on our system
2.  There is no entry in the server log for a failed epilogue or even a
message that says the job is being terminated (note jobid is now 1160 as I
had to recreate the issue to get more details).  The first failure in the
log is due to another process being run that was eventually preempted by
job 1160:

04/20/2007 08:46:16;0008;PBS_Server;Job;1160.rrmaster;Job Queued at request
of aemerich at rrmaster, owner = aemerich at rrmaster, job name = STDIN, queue =
dque
04/20/2007 08:46:26;0008;PBS_Server;Job;1160.rrmaster;Job Modified at
request of root at rrmaster
04/20/2007 08:46:26;0008;PBS_Server;Job;1160.rrmaster;could not locate
requested resources 'n01-01-06' (node_spec failed) job allocation request
exceeds currently available cluster nodes, 1 requested, 0 available
04/20/2007 08:46:26;0008;PBS_Server;Job;1160.rrmaster;Job Modified at
request of root at rrmaster
04/20/2007 08:47:38;0008;PBS_Server;Job;1160.rrmaster;Job Modified at
request of root at rrmaster
04/20/2007 08:47:38;0008;PBS_Server;Job;1160.rrmaster;Job Run at request of
root at rrmaster
04/20/2007 08:47:38;0008;PBS_Server;Job;1160.rrmaster;Job Modified at
request of root at rrmaster

3.  "qsig -s 0 1160" did not terminate the job from the server's point of
view.

Thanks

Adam Emerich
IBM Corporation - Rochester, MN
Staff Engineer
Office: 030-3 F305
Office: (507) 253-5483
Cell: (507) 358-2999
aemerich at us.ibm.com

"Insanity: doing the same thing over and over again and expecting different
results."   -Albert Einstein


                                                                           
             Garrick Staples                                               
             <garrick at clusterr                                             
             esources.com>                                              To 
             Sent by:                  torqueusers at supercluster.org        
             torqueusers-bounc                                          cc 
             es at supercluster.o                                             
             rg                                                    Subject 
                                       Re: [torqueusers] Torque not        
                                       deleting job                        
             04/20/2007 10:13                                              
             AM                                                            
                                                                           
                                                                           
                                                                           
                                                                           




On Thu, Apr 19, 2007 at 11:41:59AM -0500, Adam Emerich alleged:
>
> I am seeing a case in which torque does not delete an interactive job if
> the node on which the job is running goes down.  Here is what I am doing:
>
>    qsub -I -l nodes=n01-01-06:ppn=1       -> successfully returns a
prompt
>    on the machine requested
>
> Then the node (n01-01-06) is reboot.  After the reboot "top" on n01-01-06
> does not show any jobs being run by my userid.  However, "showq" shows
the
> following on the torque server:

Is pbs_mom being started with the -r option at boot?

Can you check in server_log to see if an epilogue came and was rejected?

Does 'qsig -s 0 1131' cause the job to exit?

_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers




More information about the torqueusers mailing list