[gold-users] are stale reservations normal

Brock Palen brockp at umich.edu
Thu Mar 18 11:38:21 MDT 2010


gold at cac-admin01$ glsjob -J 3282258
Id      JobId   User    Project Machine Queue QualityOfService Stage    
Charge Processors Nodes WallDuration StartTime EndTime Description
------- ------- ------- ------- ------- ----- ---------------- -------  
------ ---------- ----- ------------ --------- ------- -----------
4427406 3282258 aducoin ylyoung nyx           ylyoung           
Reserve      0 4                86400


AMCFG[bank] SERVER=gold://cac-admin01.engin.umich.edu  
JOBFAILUREACTION=IGNORE CHARGEPOLICY=DEBITALLWC TIMEOUT=30


I am not so sure about the Moab log, your asking for, under normal  
conditions our moab.log has lines like this:

03/18 10:23:35  MAMAllocJDebit(A,3287524,SC,EMsg)
03/18 10:23:35  MS3DoCommand(allocation-manager,NULL,OBuf,ODE,SC,EMsg)
03/18 10:23:35  MSUSendData(S,30000000,FALSE,FALSE,SC,NULL)
03/18 10:23:35  INFO:     packet sent (696 bytes of 696)
03/18 10:23:35  INFO:     command sent to server
03/18 10:23:35  INFO:     message sent: '<XML>'
03/18 10:23:35  MSURecvData(,30000000,FALSE,SC,EMsg)
03/18 10:23:35  MSURecvPacket(13,BufP,1024,^M
^M
,30000000,SC)
03/18 10:23:36  MSURecvPacket(13,BufP,1024,^M
,30000000,SC)
03/18 10:23:36  MSURecvPacket(13,BufP,341,NULL,30000000,SC)
03/18 10:23:36  INFO:     response received from server
03/18 10:23:36  INFO:     response received: '<?xml version="1.0"  
encoding="UTF-8"?>
<Envelope><Body><Response actor="gold"><Status><Value>Success</ 
Value><Code>000</Code><Message>Successfully charged job 3287524 for 10  
credits
1 reservations were removed</Message></Status><Count>10</ 
Count><Data><Charge><Amount>10</Amount><Job>4457902</Job></Charge></ 
Data></Response></Body></Envelope>
'
03/18 10:23:36  MSUDisconnect(13)
03/18 10:23:36  INFO:     command response '<?xml version="1.0"  
encoding="UTF-8"?>
<Envelope><Body><Response actor="gold"><Status><Value>Success</ 
Value><Code>000</Code><Message>Successfully charged job 3287524 for 10  
credits
1 reservations were removed</Message></Status><Count>10</ 
Count><Data><Charge><Amount>10</Amount><Job>4457902</Job></Charge></ 
Data></Response></Body></Envelope>


Note that the log for the job that does not get charged, Goe from  
MAMAllocJDebit() to MJobSendFB and I get the alerts invlalid system  
queue time, and I never see XML lines, or any data showing up in  
goldd.log


Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
brockp at umich.edu
(734)936-1985



On Mar 18, 2010, at 1:21 PM, Scott Jackson wrote:

> Brock,
>
> It appears to me that Moab is trying to charge -- at least that is  
> what I think MAMAllocJDebit should be doing. You would need to bump  
> moab's LOGLEVEL up to about 7 to see the details of what it might be  
> sending to Gold here. Please increase the loglevel and give me an  
> excerpt.
>
> Also, I would be interested to see what the related glsjob shows for  
> this job (Reserve or Charge). This should indicate the last job  
> action taken against this job was:
>
> glsjob -J 3282258
>
> Also, please send your moab.cfg. I want to see what your  
> CHARGEPOLICY is. If you have something like DEBITSUCCESSFULWC, then  
> it will only debit for jobs it deems to be successful. If Moab is  
> marking this as a failed job, it may not be charging. I would  
> generally recommend using the ALL variants of CHARGEPOLICY (like  
> DEBITALLWC). Excuse me if I misspelled some of these policy names, I  
> did not look them up:)
>
> Thanks,
>
> Scott
>
>
> Brock Palen wrote:
>> It looks like moab is getting in a odd state and not issuing the  
>> charge,
>>
>> glsres -I
>> 4380349 3282258   345600 2010-03-17 11:39:13 2010-03-18 11:49:13  
>> 4427406 aducoin  ylyoung nyx     2
>>
>> In the moab logs I only see this Alert over and over:
>>
>> 03/17 11:44:59  ALERT:    job '3282258' has been in state 'Running'  
>> for 306 seconds.  node 'nyx0900' is in state 'Running'  (job  
>> '3282258' will be cancelled
>> )
>> 03/17 11:44:59  MSysRegEvent(JOBCORRUPTION:  job '3282258' (user  
>> aducoin) has been in state 'Running' for 306 seconds.  node  
>> 'nyx0900' is in state 'Running'
>>  (job '3282258' will be cancelled)
>>
>> 03/17 15:01:23  MJobProcessCompleted(3282258)
>> 03/17 15:01:23  MJobProcessTVariables(3282258)
>> 03/17 15:01:23  MAMAllocJDebit(A,3282258,SC,EMsg)
>> 03/17 15:01:23  MJobSendFB(3282258)
>> 03/17 15:01:23  MSysLaunchAction(ASList,)
>> 03/17 15:01:23  INFO:     job usage sent for job '3282258'
>> 03/17 15:01:23  ALERT:    job '           3282258' has invalid  
>> system queue time (SQ: 1268852118 > ST: 1268840355)
>> 03/17 15:01:23  INFO:     job '           3282258' completed.   
>> QueueTime:      0  RunTime:  11960  Accuracy: 13.84  XFactor:  0.14
>> 03/17 15:01:23  INFO:     overall statistics.  Accuracy:   nan   
>> XFactor:   inf
>> 03/17 15:01:23  INFO:     job '3282258' completed  X: 0.138426  T:  
>> 11960  PS: 47840  A: 0.138426 (RM: nyx/nyx)
>> 03/17 15:01:23  MReqCreate(3282258,SrcRQ,DstRQ,TRUE)
>> 03/17 15:01:23  INFO:     added completed job '3282258', Job  
>> Completion Time Wed Mar 17 14:58:35
>>
>> 03/17 15:01:23  INFO:     node 'nyx0900' released from job 3282258
>> 03/17 15:01:23  MJobRemove(3282258)
>> 03/17 15:01:23  MJobDestroyVM(3282258,EMsg)
>> 03/17 15:01:23  MRsvDestroy(3282258,TRUE,TRUE)
>> 03/17 15:01:23  MRsvDestroyCredLock(3282258)
>> 03/17 15:01:23  MJobDestroy(3282258)
>>
>> 03/17 15:06:07  MReqCreate(3282258,SrcRQ,DstRQ,TRUE)
>> 03/17 15:06:07  INFO:     added completed job '3282258', Job  
>> Completion Time Wed Mar 17 14:58:35
>> 03/17 15:06:07  MJobDestroy(3282258)
>>
>>
>> We run thousands a job a day so most jobs are not showing this  
>> behavior and get charged.
>>
>>
>> Brock Palen
>> www.umich.edu/~brockp
>> Center for Advanced Computing
>> brockp at umich.edu
>> (734)936-1985
>>
>>
>>
>> On Mar 17, 2010, at 2:11 PM, Scott Jackson wrote:
>>
>>> Brock,
>>>
>>> I might expect a few here and there, but on this scale I would say  
>>> there is something pretty wrong.
>>>
>>> I would recommend using glsres -I to get a list of ones that have  
>>> expired but were not removed. Then look for these in the goldd.log  
>>> to see if Charges were issued for them. You may find that Errors  
>>> occurred, or you may find that Moab never sent the charge request,  
>>> or you may find that there is a bug in Gold where it is charging  
>>> but the reservation is not getting removed (naturally, this is  
>>> doubtful:).
>>>
>>> Scott
>>>
>>>
>>> Brock Palen wrote:
>>>> We tend to accumulate stale reservations (things that get deleted  
>>>> with  grmres -I)
>>>>
>>>> We have setup a cron job to run grmres -I every night and  
>>>> deletes  between 100 and 500 every day.  Should this be  
>>>> happening?  What would  be causing this?
>>>>
>>>> Thanks
>>>>
>>>> Brock Palen
>>>> www.umich.edu/~brockp
>>>> Center for Advanced Computing
>>>> brockp at umich.edu
>>>> (734)936-1985
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> gold-users mailing list
>>>> gold-users at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/gold-users
>>>>
>>>
>>>
>>>
>>
>
>
>



More information about the gold-users mailing list