[gold-users] are stale reservations normal

Scott Jackson scottmo at adaptivecomputing.com
Thu Mar 18 11:21:09 MDT 2010


Brock,

It appears to me that Moab is trying to charge -- at least that is what 
I think MAMAllocJDebit should be doing. You would need to bump moab's 
LOGLEVEL up to about 7 to see the details of what it might be sending to 
Gold here. Please increase the loglevel and give me an excerpt.

Also, I would be interested to see what the related glsjob shows for 
this job (Reserve or Charge). This should indicate the last job action 
taken against this job was:

glsjob -J 3282258

Also, please send your moab.cfg. I want to see what your CHARGEPOLICY 
is. If you have something like DEBITSUCCESSFULWC, then it will only 
debit for jobs it deems to be successful. If Moab is marking this as a 
failed job, it may not be charging. I would generally recommend using 
the ALL variants of CHARGEPOLICY (like DEBITALLWC). Excuse me if I 
misspelled some of these policy names, I did not look them up:)

Thanks,

Scott


Brock Palen wrote:
> It looks like moab is getting in a odd state and not issuing the charge,
>
> glsres -I
> 4380349 3282258   345600 2010-03-17 11:39:13 2010-03-18 11:49:13 
> 4427406 aducoin  ylyoung nyx     2
>
> In the moab logs I only see this Alert over and over:
>
> 03/17 11:44:59  ALERT:    job '3282258' has been in state 'Running' 
> for 306 seconds.  node 'nyx0900' is in state 'Running'  (job '3282258' 
> will be cancelled
> )
> 03/17 11:44:59  MSysRegEvent(JOBCORRUPTION:  job '3282258' (user 
> aducoin) has been in state 'Running' for 306 seconds.  node 'nyx0900' 
> is in state 'Running'
>   (job '3282258' will be cancelled)
>
> 03/17 15:01:23  MJobProcessCompleted(3282258)
> 03/17 15:01:23  MJobProcessTVariables(3282258)
> 03/17 15:01:23  MAMAllocJDebit(A,3282258,SC,EMsg)
> 03/17 15:01:23  MJobSendFB(3282258)
> 03/17 15:01:23  MSysLaunchAction(ASList,)
> 03/17 15:01:23  INFO:     job usage sent for job '3282258'
> 03/17 15:01:23  ALERT:    job '           3282258' has invalid system 
> queue time (SQ: 1268852118 > ST: 1268840355)
> 03/17 15:01:23  INFO:     job '           3282258' completed.  
> QueueTime:      0  RunTime:  11960  Accuracy: 13.84  XFactor:  0.14
> 03/17 15:01:23  INFO:     overall statistics.  Accuracy:   nan  
> XFactor:   inf
> 03/17 15:01:23  INFO:     job '3282258' completed  X: 0.138426  T: 
> 11960  PS: 47840  A: 0.138426 (RM: nyx/nyx)
> 03/17 15:01:23  MReqCreate(3282258,SrcRQ,DstRQ,TRUE)
> 03/17 15:01:23  INFO:     added completed job '3282258', Job 
> Completion Time Wed Mar 17 14:58:35
>
> 03/17 15:01:23  INFO:     node 'nyx0900' released from job 3282258
> 03/17 15:01:23  MJobRemove(3282258)
> 03/17 15:01:23  MJobDestroyVM(3282258,EMsg)
> 03/17 15:01:23  MRsvDestroy(3282258,TRUE,TRUE)
> 03/17 15:01:23  MRsvDestroyCredLock(3282258)
> 03/17 15:01:23  MJobDestroy(3282258)
>
> 03/17 15:06:07  MReqCreate(3282258,SrcRQ,DstRQ,TRUE)
> 03/17 15:06:07  INFO:     added completed job '3282258', Job 
> Completion Time Wed Mar 17 14:58:35
> 03/17 15:06:07  MJobDestroy(3282258)
>
>
> We run thousands a job a day so most jobs are not showing this 
> behavior and get charged.
>
>
> Brock Palen
> www.umich.edu/~brockp
> Center for Advanced Computing
> brockp at umich.edu
> (734)936-1985
>
>
>
> On Mar 17, 2010, at 2:11 PM, Scott Jackson wrote:
>
>> Brock,
>>
>> I might expect a few here and there, but on this scale I would say 
>> there is something pretty wrong.
>>
>> I would recommend using glsres -I to get a list of ones that have 
>> expired but were not removed. Then look for these in the goldd.log to 
>> see if Charges were issued for them. You may find that Errors 
>> occurred, or you may find that Moab never sent the charge request, or 
>> you may find that there is a bug in Gold where it is charging but the 
>> reservation is not getting removed (naturally, this is doubtful:).
>>
>> Scott
>>
>>
>> Brock Palen wrote:
>>> We tend to accumulate stale reservations (things that get deleted 
>>> with  grmres -I)
>>>
>>> We have setup a cron job to run grmres -I every night and deletes  
>>> between 100 and 500 every day.  Should this be happening?  What 
>>> would  be causing this?
>>>
>>> Thanks
>>>
>>> Brock Palen
>>> www.umich.edu/~brockp
>>> Center for Advanced Computing
>>> brockp at umich.edu
>>> (734)936-1985
>>>
>>>
>>>
>>> _______________________________________________
>>> gold-users mailing list
>>> gold-users at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/gold-users
>>>
>>
>>
>>
>



More information about the gold-users mailing list