[torqueusers] Torque not deleting job

Adam Emerich aemerich at us.ibm.com
Fri Apr 20 07:58:06 MDT 2007


Here are the answers to your questions (output is from the compute node
after the reboot):

1.  pbs_mom is set to start automatically upon reboot, and I did confirm
that it starts.
2.  [root at rrmaster ~]# checknode n01-01-06

checking node n01-01-06

State:   Running  (in current state for 00:00:00)
Configured Resources: PROCS: 8  MEM: 31G  SWAP: 31G  DISK: 1M
Utilized   Resources: [NONE]
Dedicated  Resources: PROCS: 1
Opsys:         linux  Arch:      [NONE]
Speed:      1.00  Load:       0.190
Network:    [DEFAULT]
Attributes: [Batch]
Classes:    [batch 8:8][dque 7:8][loadl 8:8]

Total Time: 13:13:36:09  Up: 13:13:23:34 (99.94%)  Active: 6:20:49:40

  Job '1160'(x1)  -00:07:01 -> 11:52:59 (12:00:00)
JobList:  1160
ALERT:  node has 1 procs dedicated but load is low (0.190)

3.  [root at rrmaster ~]# pbsnodes -l
n01-01-02            down,job-exclusive

4.  This is on a linux cluster where both server and nodes are running
Fedora Core 6


Adam Emerich
IBM Corporation - Rochester, MN
Staff Engineer
Office: 030-3 F305
Office: (507) 253-5483
Cell: (507) 358-2999
aemerich at us.ibm.com

"Insanity: doing the same thing over and over again and expecting different
results."   -Albert Einstein

             Chris Samuel                                                  
             <csamuel at vpac.org                                             
             >                                                          To 
             Sent by:                  torqueusers at supercluster.org        
             torqueusers-bounc                                          cc 
             es at supercluster.o                                             
             rg                                                    Subject 
                                       Re: [torqueusers] Torque not        
                                       deleting job                        
             04/20/2007 03:56                                              

On Fri, 20 Apr 2007, Adam Emerich wrote:

> I am seeing a case in which torque does not delete an interactive job if
> the node on which the job is running goes down.

Some (probably silly) questions:

Is someone starting pbs_mom on the node once the node it is back up ?

What does checknode say ?

What does pbsnodes -l say ?

Is this on AIX by some chance ?

 Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager
 Victorian Partnership for Advanced Computing http://www.vpac.org/
 Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia

torqueusers mailing list
torqueusers at supercluster.org

More information about the torqueusers mailing list