[gold-users] are stale reservations normal

Scott Jackson scottmo at adaptivecomputing.com
Thu Mar 18 11:51:36 MDT 2010


Brock,

So it appears you are right in your evaluation. It appears that Moab is 
simply not sending the charge request to Gold (assuming there is no 
change in LOGLEVEL between the two moab.log excerpts you gave me). 
Indeed the CHARGEPOLICY=DEBITALLWC should be charging for all jobs. This 
appears to be a problem in moab. I would recommend you submit a ticket 
to moab-support and provide the relevant evidence collected so far 
(especially the two cases where MAMAllocJDebit results in a call to Gold 
whereas the other does not).

Thanks,

Scott


Brock Palen wrote:
> gold at cac-admin01$ glsjob -J 3282258
> Id      JobId   User    Project Machine Queue QualityOfService Stage   
> Charge Processors Nodes WallDuration StartTime EndTime Description
> ------- ------- ------- ------- ------- ----- ---------------- ------- 
> ------ ---------- ----- ------------ --------- ------- -----------
> 4427406 3282258 aducoin ylyoung nyx           ylyoung          
> Reserve      0 4                86400
>
>
> AMCFG[bank] SERVER=gold://cac-admin01.engin.umich.edu 
> JOBFAILUREACTION=IGNORE CHARGEPOLICY=DEBITALLWC TIMEOUT=30
>
>
> I am not so sure about the Moab log, your asking for, under normal 
> conditions our moab.log has lines like this:
>
> 03/18 10:23:35  MAMAllocJDebit(A,3287524,SC,EMsg)
> 03/18 10:23:35  MS3DoCommand(allocation-manager,NULL,OBuf,ODE,SC,EMsg)
> 03/18 10:23:35  MSUSendData(S,30000000,FALSE,FALSE,SC,NULL)
> 03/18 10:23:35  INFO:     packet sent (696 bytes of 696)
> 03/18 10:23:35  INFO:     command sent to server
> 03/18 10:23:35  INFO:     message sent: '<XML>'
> 03/18 10:23:35  MSURecvData(,30000000,FALSE,SC,EMsg)
> 03/18 10:23:35  MSURecvPacket(13,BufP,1024,^M
> ^M
> ,30000000,SC)
> 03/18 10:23:36  MSURecvPacket(13,BufP,1024,^M
> ,30000000,SC)
> 03/18 10:23:36  MSURecvPacket(13,BufP,341,NULL,30000000,SC)
> 03/18 10:23:36  INFO:     response received from server
> 03/18 10:23:36  INFO:     response received: '<?xml version="1.0" 
> encoding="UTF-8"?>
> <Envelope><Body><Response 
> actor="gold"><Status><Value>Success</Value><Code>000</Code><Message>Successfully 
> charged job 3287524 for 10 credits
> 1 reservations were 
> removed</Message></Status><Count>10</Count><Data><Charge><Amount>10</Amount><Job>4457902</Job></Charge></Data></Response></Body></Envelope> 
>
> '
> 03/18 10:23:36  MSUDisconnect(13)
> 03/18 10:23:36  INFO:     command response '<?xml version="1.0" 
> encoding="UTF-8"?>
> <Envelope><Body><Response 
> actor="gold"><Status><Value>Success</Value><Code>000</Code><Message>Successfully 
> charged job 3287524 for 10 credits
> 1 reservations were 
> removed</Message></Status><Count>10</Count><Data><Charge><Amount>10</Amount><Job>4457902</Job></Charge></Data></Response></Body></Envelope> 
>
>
>
> Note that the log for the job that does not get charged, Goe from 
> MAMAllocJDebit() to MJobSendFB and I get the alerts invlalid system 
> queue time, and I never see XML lines, or any data showing up in 
> goldd.log
>
>
> Brock Palen
> www.umich.edu/~brockp
> Center for Advanced Computing
> brockp at umich.edu
> (734)936-1985
>
>
>
> On Mar 18, 2010, at 1:21 PM, Scott Jackson wrote:
>
>> Brock,
>>
>> It appears to me that Moab is trying to charge -- at least that is 
>> what I think MAMAllocJDebit should be doing. You would need to bump 
>> moab's LOGLEVEL up to about 7 to see the details of what it might be 
>> sending to Gold here. Please increase the loglevel and give me an 
>> excerpt.
>>
>> Also, I would be interested to see what the related glsjob shows for 
>> this job (Reserve or Charge). This should indicate the last job 
>> action taken against this job was:
>>
>> glsjob -J 3282258
>>
>> Also, please send your moab.cfg. I want to see what your CHARGEPOLICY 
>> is. If you have something like DEBITSUCCESSFULWC, then it will only 
>> debit for jobs it deems to be successful. If Moab is marking this as 
>> a failed job, it may not be charging. I would generally recommend 
>> using the ALL variants of CHARGEPOLICY (like DEBITALLWC). Excuse me 
>> if I misspelled some of these policy names, I did not look them up:)
>>
>> Thanks,
>>
>> Scott
>>
>>
>> Brock Palen wrote:
>>> It looks like moab is getting in a odd state and not issuing the 
>>> charge,
>>>
>>> glsres -I
>>> 4380349 3282258   345600 2010-03-17 11:39:13 2010-03-18 11:49:13 
>>> 4427406 aducoin  ylyoung nyx     2
>>>
>>> In the moab logs I only see this Alert over and over:
>>>
>>> 03/17 11:44:59  ALERT:    job '3282258' has been in state 'Running' 
>>> for 306 seconds.  node 'nyx0900' is in state 'Running'  (job 
>>> '3282258' will be cancelled
>>> )
>>> 03/17 11:44:59  MSysRegEvent(JOBCORRUPTION:  job '3282258' (user 
>>> aducoin) has been in state 'Running' for 306 seconds.  node 
>>> 'nyx0900' is in state 'Running'
>>>  (job '3282258' will be cancelled)
>>>
>>> 03/17 15:01:23  MJobProcessCompleted(3282258)
>>> 03/17 15:01:23  MJobProcessTVariables(3282258)
>>> 03/17 15:01:23  MAMAllocJDebit(A,3282258,SC,EMsg)
>>> 03/17 15:01:23  MJobSendFB(3282258)
>>> 03/17 15:01:23  MSysLaunchAction(ASList,)
>>> 03/17 15:01:23  INFO:     job usage sent for job '3282258'
>>> 03/17 15:01:23  ALERT:    job '           3282258' has invalid 
>>> system queue time (SQ: 1268852118 > ST: 1268840355)
>>> 03/17 15:01:23  INFO:     job '           3282258' completed.  
>>> QueueTime:      0  RunTime:  11960  Accuracy: 13.84  XFactor:  0.14
>>> 03/17 15:01:23  INFO:     overall statistics.  Accuracy:   nan  
>>> XFactor:   inf
>>> 03/17 15:01:23  INFO:     job '3282258' completed  X: 0.138426  T: 
>>> 11960  PS: 47840  A: 0.138426 (RM: nyx/nyx)
>>> 03/17 15:01:23  MReqCreate(3282258,SrcRQ,DstRQ,TRUE)
>>> 03/17 15:01:23  INFO:     added completed job '3282258', Job 
>>> Completion Time Wed Mar 17 14:58:35
>>>
>>> 03/17 15:01:23  INFO:     node 'nyx0900' released from job 3282258
>>> 03/17 15:01:23  MJobRemove(3282258)
>>> 03/17 15:01:23  MJobDestroyVM(3282258,EMsg)
>>> 03/17 15:01:23  MRsvDestroy(3282258,TRUE,TRUE)
>>> 03/17 15:01:23  MRsvDestroyCredLock(3282258)
>>> 03/17 15:01:23  MJobDestroy(3282258)
>>>
>>> 03/17 15:06:07  MReqCreate(3282258,SrcRQ,DstRQ,TRUE)
>>> 03/17 15:06:07  INFO:     added completed job '3282258', Job 
>>> Completion Time Wed Mar 17 14:58:35
>>> 03/17 15:06:07  MJobDestroy(3282258)
>>>
>>>
>>> We run thousands a job a day so most jobs are not showing this 
>>> behavior and get charged.
>>>
>>>
>>> Brock Palen
>>> www.umich.edu/~brockp
>>> Center for Advanced Computing
>>> brockp at umich.edu
>>> (734)936-1985
>>>
>>>
>>>
>>> On Mar 17, 2010, at 2:11 PM, Scott Jackson wrote:
>>>
>>>> Brock,
>>>>
>>>> I might expect a few here and there, but on this scale I would say 
>>>> there is something pretty wrong.
>>>>
>>>> I would recommend using glsres -I to get a list of ones that have 
>>>> expired but were not removed. Then look for these in the goldd.log 
>>>> to see if Charges were issued for them. You may find that Errors 
>>>> occurred, or you may find that Moab never sent the charge request, 
>>>> or you may find that there is a bug in Gold where it is charging 
>>>> but the reservation is not getting removed (naturally, this is 
>>>> doubtful:).
>>>>
>>>> Scott
>>>>
>>>>
>>>> Brock Palen wrote:
>>>>> We tend to accumulate stale reservations (things that get deleted 
>>>>> with  grmres -I)
>>>>>
>>>>> We have setup a cron job to run grmres -I every night and deletes  
>>>>> between 100 and 500 every day.  Should this be happening?  What 
>>>>> would  be causing this?
>>>>>
>>>>> Thanks
>>>>>
>>>>> Brock Palen
>>>>> www.umich.edu/~brockp
>>>>> Center for Advanced Computing
>>>>> brockp at umich.edu
>>>>> (734)936-1985
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> gold-users mailing list
>>>>> gold-users at supercluster.org
>>>>> http://www.supercluster.org/mailman/listinfo/gold-users
>>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>>
>



More information about the gold-users mailing list