[torquedev] Walltime.Remaining part 2
Vikentsi Lapa
vlapa at newman.bas-net.by
Mon Nov 22 07:02:50 MST 2010
I found changes and test fixed pbs_server. Now other problem appear when job exceed walltime time.
My job file is
#!/bin/sh
#PBS -N PBSTest
#PBS -l nodes=4:ppn=2,walltime=00:00:20
hostname
sleep 40
Output
$ qstat -f | grep '\(Wall\|job_state\)'
job_state = R
Walltime.Remaining = 15
$ qstat -f | grep '\(Wall\|job_state\)'
job_state = R
Walltime.Remaining = 13
. . .
$ qstat -f | grep '\(Wall\|job_state\)'
job_state = R
Walltime.Remaining = 1
$ qstat -f | grep '\(Wall\|job_state\)'
job_state = R
Walltime.Remaining = 0
$ qstat -f | grep '\(Wall\|job_state\)'
job_state = R
Walltime.Remaining = 4294967294
$ qstat -f | grep '\(Wall\|job_state\)'
job_state = R
Walltime.Remaining = 4294967292
$ qstat -f | grep '\(Wall\|job_state\)'
job_state = R
Walltime.Remaining = 4294967291
Job result
PBSTest.?1642
=>> PBS: job killed: walltime 51 exceeded limit 20
/var/spool/torque/mom_priv/jobs/1642.headnode.scc.by.SC: line 9: 2624 Terminated sleep 120
After that i try run job one more time and recive following result
$ qstat -f | grep '\(Wall\|job_state\)'
job_state = Q
Walltime.Remaining = 17
$ qstat -f | grep '\(Wall\|job_state\)'
job_state = Q
Walltime.Remaining = 13
qstat -f | grep '\(Wall\|job_state\)'
job_state = Q
Walltime.Remaining = 0
$ qstat -f | grep '\(Wall\|job_state\)'
job_state = Q
Walltime.Remaining = 4294967294
checkjob output
job is deferred. Reason: RMFailure (cannot start job - RM failure, rc: 15043, msg: 'Execution server rejected request MSG=cannot send job to mom, state=PRERUN')
Holds: Defer (hold reason: RMFailure)
PE: 8.00 StartPriority: 1
cannot select job 1643 for partition DEFAULT (job hold active)
On Fri, Nov 19, 2010 at 02:15:47PM -0700, David Beer wrote:
>
> I have made several fixes to take care of walltime remaining. First, I checked in your patch. Second, it should not print if the job hasn't started (this is why it has a negative value, since the job hasn't started). Third, it should be printed as "%lu" because it is unsigned. There should be no negative time values.
>
More information about the torquedev
mailing list