[torquedev] Walltime.Remaining part 2

Vikentsi Lapa vlapa at newman.bas-net.by
Mon Nov 22 07:02:50 MST 2010


I found changes and test fixed pbs_server. Now other problem appear when job exceed walltime time.

My job file is

#!/bin/sh
#PBS -N PBSTest
#PBS -l nodes=4:ppn=2,walltime=00:00:20

hostname
sleep 40

Output 

$ qstat -f | grep '\(Wall\|job_state\)'
    job_state = R
    Walltime.Remaining = 15

$ qstat -f | grep '\(Wall\|job_state\)'
    job_state = R
    Walltime.Remaining = 13

. . .

$ qstat -f | grep '\(Wall\|job_state\)'
    job_state = R
    Walltime.Remaining = 1

$ qstat -f | grep '\(Wall\|job_state\)'
    job_state = R
    Walltime.Remaining = 0

$ qstat -f | grep '\(Wall\|job_state\)'
    job_state = R
    Walltime.Remaining = 4294967294

$ qstat -f | grep '\(Wall\|job_state\)'
    job_state = R
    Walltime.Remaining = 4294967292

$ qstat -f | grep '\(Wall\|job_state\)'
    job_state = R
    Walltime.Remaining = 4294967291


Job result

PBSTest.?1642
=>> PBS: job killed: walltime 51 exceeded limit 20
/var/spool/torque/mom_priv/jobs/1642.headnode.scc.by.SC: line 9:  2624 Terminated              sleep 120


After that i try run job one more time and recive following result

$ qstat -f | grep '\(Wall\|job_state\)'
    job_state = Q
    Walltime.Remaining = 17

$ qstat -f | grep '\(Wall\|job_state\)'
    job_state = Q
    Walltime.Remaining = 13

qstat -f | grep '\(Wall\|job_state\)'
    job_state = Q
    Walltime.Remaining = 0

$ qstat -f | grep '\(Wall\|job_state\)'
    job_state = Q
    Walltime.Remaining = 4294967294

checkjob output

job is deferred.  Reason:  RMFailure  (cannot start job - RM failure, rc: 15043, msg: 'Execution server rejected request MSG=cannot send job to mom, state=PRERUN')
Holds:    Defer  (hold reason:  RMFailure)
PE:  8.00  StartPriority:  1
cannot select job 1643 for partition DEFAULT (job hold active) 



On Fri, Nov 19, 2010 at 02:15:47PM -0700, David Beer wrote:
> 
> I have made several fixes to take care of walltime remaining. First, I checked in your patch. Second, it should not print if the job hasn't started (this is why it has a negative value, since the job hasn't started). Third, it should be printed as "%lu" because it is unsigned. There should be no negative time values. 
> 


More information about the torquedev mailing list