[torquedev] Walltime.Remaining part 2 [patch]
Vikentsi Lapa
vlapa at newman.bas-net.by
Thu Dec 2 05:09:59 MST 2010
Error is cygwin related. They appear because we don't have root (UID=0) user in cygwin enviroment and seteuid(0) not work. Error appear only if one of limit exceeded. Attached patch fix this differences.
On Mon, Nov 22, 2010 at 04:02:50PM +0200, Vikentsi Lapa wrote:
> I found changes and test fixed pbs_server. Now other problem appear when job exceed walltime time.
>
> My job file is
>
> #!/bin/sh
> #PBS -N PBSTest
> #PBS -l nodes=4:ppn=2,walltime=00:00:20
>
> hostname
> sleep 40
>
> Job result
>
> PBSTest.?1642
> =>> PBS: job killed: walltime 51 exceeded limit 20
> /var/spool/torque/mom_priv/jobs/1642.headnode.scc.by.SC: line 9: 2624 Terminated sleep 120
>
>
> After that i try run job one more time and recive following result
>
> job is deferred. Reason: RMFailure (cannot start job - RM failure, rc: 15043, msg: 'Execution server rejected request MSG=cannot send job to mom, state=PRERUN')
> Holds: Defer (hold reason: RMFailure)
> PE: 8.00 StartPriority: 1
> cannot select job 1643 for partition DEFAULT (job hold active)
>
>
>
> On Fri, Nov 19, 2010 at 02:15:47PM -0700, David Beer wrote:
> >
> > I have made several fixes to take care of walltime remaining. First, I checked in your patch. Second, it should not print if the job hasn't started (this is why it has a negative value, since the job hasn't started). Third, it should be printed as "%lu" because it is unsigned. There should be no negative time values.
> >
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: seteuid.patch
Type: text/x-diff
Size: 2015 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torquedev/attachments/20101202/1e9c8ad1/attachment-0001.bin
More information about the torquedev
mailing list