[torquedev] Walltime.Remaining part 2 [patch]

Vikentsi Lapa vlapa at newman.bas-net.by
Thu Dec 2 05:09:59 MST 2010


 Error is cygwin related. They appear because we don't have root (UID=0) user in cygwin enviroment and seteuid(0) not work. Error appear only if one of limit exceeded. Attached patch fix this differences.

On Mon, Nov 22, 2010 at 04:02:50PM +0200, Vikentsi Lapa wrote:
> I found changes and test fixed pbs_server. Now other problem appear when job exceed walltime time.
> 
> My job file is
> 
> #!/bin/sh
> #PBS -N PBSTest
> #PBS -l nodes=4:ppn=2,walltime=00:00:20
> 
> hostname
> sleep 40
> 
> Job result
> 
> PBSTest.?1642
> =>> PBS: job killed: walltime 51 exceeded limit 20
> /var/spool/torque/mom_priv/jobs/1642.headnode.scc.by.SC: line 9:  2624 Terminated              sleep 120
> 
> 
> After that i try run job one more time and recive following result
> 
> job is deferred.  Reason:  RMFailure  (cannot start job - RM failure, rc: 15043, msg: 'Execution server rejected request MSG=cannot send job to mom, state=PRERUN')
> Holds:    Defer  (hold reason:  RMFailure)
> PE:  8.00  StartPriority:  1
> cannot select job 1643 for partition DEFAULT (job hold active) 
> 
> 
> 
> On Fri, Nov 19, 2010 at 02:15:47PM -0700, David Beer wrote:
> > 
> > I have made several fixes to take care of walltime remaining. First, I checked in your patch. Second, it should not print if the job hasn't started (this is why it has a negative value, since the job hasn't started). Third, it should be printed as "%lu" because it is unsigned. There should be no negative time values. 
> > 
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: seteuid.patch
Type: text/x-diff
Size: 2015 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torquedev/attachments/20101202/1e9c8ad1/attachment-0001.bin 


More information about the torquedev mailing list