[gold-users] gold fails to remove reservations

Scott Jackson scottmo at adaptivecomputing.com
Thu May 24 11:58:15 MDT 2012


Eva,

I'm glad you figured out a remedy to your problem.

Yes, it is definitely possible to charge more than what was reserved if
there is a grace period that allows jobs to run past their requested
wallclock limits. This can result in negative balances.

As far as the multiple reservations: Since they are both there, it suggests
that they both succeeded (i.e. had enough funds at the time to start), and
that for some reason the first job start attempt failed to start (perhaps
due to a RM problem).

Scott

On Thu, May 24, 2012 at 11:46 AM, Eva Hocks <hocks at sdsc.edu> wrote:

>
>
> Scott,
>
> thanks for your help!
>
> maui/gold charges in hours due to
> ChargeRate Modify Type==Resource Name==Processors Rate=0.00027777
>
> The checkjob -v had
> job is deferred.  Reason:  BankFailure  (cannot debit job account)
> Holds:    Defer  (hold reason:  BankFailure)
>
> We do not use any prolog/epilog with maui, just with torque.
>
> What I found out though maui added 10 minutes to the end time of the job
> due to:
> RESOURCELIMITPOLICY MEM:ALWAYS,EXTENDEDVIOLATION:NOTIFY,CANCEL:00:05:00
> RESOURCELIMITPOLICY WALLTIME:ALWAYS:NOTIFY,CANCEL:00:05:00
>
> which may have gotten the debit for the requested time greater that the
> credit
> (maui seems to round up to the next hour while moab only did so for more
> than 30
> minutes)
>
> I solved the problem with removing the reservations in gold
> goldsh Reservation Delete Id==953163
>
> and added some more credits and the job started.
>
>
> Thanks again for your help.
> Eva
>
>
>
> On Thu, 24 May 2012, Scott Jackson wrote:
>
> > Eva,
> >
> > On Wed, May 23, 2012 at 2:01 PM, Eva Hocks <hocks at sdsc.edu> wrote:
> >
> > >
> > > using maui3.2.6p21-1/gold 2.2.0.1-1: a job requesting 10 hours on 96
> > > processors
> > > doesn't start due to cannot debit account but it did indeed have
> sufficient
> > > credit.
> > >
> > > $ showq|grep 964040
> > > 964040               xyz   Deferred    96    10:00:00  Wed May 23
> 10:55:02
> > >
> > > job asking for 96 processors for 10 hours * 1 = 960
> > >
> >
> > First, If this is a 10 hour job, I am little confused as to why maui is
> not
> > charging 96*10*3600. As far as I knew, maui charges by second and not by
> > hour.
> >
> > What does a checkjob -v on the job say is the reason for the deferral?
> >
> >
> > > $ gbalance -u fpaolo
> > > Id  Name   Amount Reserved Balance CreditLimit Available
> > > --- ------ ------ -------- ------- ----------- ---------
> > > 921 xyz    1965     1920      45           0        45
> > >
> > >
> > >
> > > looking at the gold database why does one job 964040 have 2
> reservations ?
> > >
> > > $ goldsh Reservation Query Name==964040
> > > Id     Name   Job     User   Project Machine StartTime
> EndTime
> > >           Description
> > > ------ ------ ------- ------ ------- ------- -------------------
> > > ------------------- -----------
> > > 953044 964040 1011912 fpaolo fpaolo  Triton  2012-05-23 10:55:03
> > > 2012-05-23 21:05:03
> > > 953239 964040 1012109 fpaolo fpaolo  Triton  2012-05-23 12:36:33
> > > 2012-05-23 22:46:33
> > >
> > >
> > I am not sure, but it would be a problem in maui for leaving the
> > reservations around (or a problem in the prolog and epilog scripts since
> I
> > don't see how maui is going to be charging 960 for the job). One
> > improvement that we made in Moab was in linking the jobs up better. When
> > Maui was written, if a job tried to start multiple times, it could easily
> > wind up creating multiple reservations. Later we added some flags to Gold
> > to better address this situation (Moab calls Job Reserve with the
> > Replace:=True flag which deletes any reservations of the same name before
> > creating a new reservation). I suspect that either this is the problem
> (and
> > you are using an edited version of maui for the hours thing) or that you
> > are using prologs and epilogs which do not call the Replace option.
> >
> > Scott
> >
> >
> > >
> > > Id      Object Action  Actor Name Child JobId  Amount Delta Account
> > > Project User   Machine Allocation Count Description Details
> > >
> > >
> > >                                                   CreationTime
> > >  ModificationTime    Deleted RequestId TransactionId ------- ------
> -------
> > > ----- ---- ----- ------ ------ ----- ------- ------- ------ -------
> > > ---------- ----- ----------- ---
> > >
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> > > ------------------- ------------------- ------- --------- -------------
> > > 9462919 Job    Reserve maui             964040    960
> fpaolo
> > >  fpaolo Triton             1
> > >
> WallDuration=36000,Processors=96,QualityOfService=DEFAULT,Queue=batch,Stage=Reserve,Charge=0,CallType=Normal,ItemizedCharges:=(
> > > ( ( 96 [Processors] * 0.00027777 [ChargeRate{VBR}{Processors}] ) ) *
> 36000
> > > [WallDuration] ) = 960 2012-05-23 10:55:06 2012-05-23 10:55:06 False
> > > 2901179   9462919
> > > 9465533 Job    Reserve maui             964040    960
> fpaolo
> > >  fpaolo Triton             1
> > >
> WallDuration=36000,Processors=96,QualityOfService=DEFAULT,Queue=batch,Stage=Reserve,Charge=0,CallType=Normal,ItemizedCharges:=(
> > > ( ( 96 [Processors] * 0.00027777 [ChargeRate{VBR}{Processors}] ) ) *
> 36000
> > > [WallDuration] ) = 960 2012-05-23 12:36:44 2012-05-23 12:36:44 False
> > > 2901826   9465533
> > >
> > > It does not delete those correctly thus any further try at statring
> fails
> > > as
> > > well. Any idea why the reservation request is not removed? It does not
> > > happen
> > > with other jobs.
> > >
> > > Thanks
> > > Eva
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > _______________________________________________
> > > gold-users mailing list
> > > gold-users at supercluster.org
> > > http://www.supercluster.org/mailman/listinfo/gold-users
> > >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/gold-users/attachments/20120524/094f575e/attachment-0001.html 


More information about the gold-users mailing list