[gold-users] are stale reservations normal
scottmo at adaptivecomputing.com
Thu Mar 18 11:21:09 MDT 2010
It appears to me that Moab is trying to charge -- at least that is what
I think MAMAllocJDebit should be doing. You would need to bump moab's
LOGLEVEL up to about 7 to see the details of what it might be sending to
Gold here. Please increase the loglevel and give me an excerpt.
Also, I would be interested to see what the related glsjob shows for
this job (Reserve or Charge). This should indicate the last job action
taken against this job was:
glsjob -J 3282258
Also, please send your moab.cfg. I want to see what your CHARGEPOLICY
is. If you have something like DEBITSUCCESSFULWC, then it will only
debit for jobs it deems to be successful. If Moab is marking this as a
failed job, it may not be charging. I would generally recommend using
the ALL variants of CHARGEPOLICY (like DEBITALLWC). Excuse me if I
misspelled some of these policy names, I did not look them up:)
Brock Palen wrote:
> It looks like moab is getting in a odd state and not issuing the charge,
> glsres -I
> 4380349 3282258 345600 2010-03-17 11:39:13 2010-03-18 11:49:13
> 4427406 aducoin ylyoung nyx 2
> In the moab logs I only see this Alert over and over:
> 03/17 11:44:59 ALERT: job '3282258' has been in state 'Running'
> for 306 seconds. node 'nyx0900' is in state 'Running' (job '3282258'
> will be cancelled
> 03/17 11:44:59 MSysRegEvent(JOBCORRUPTION: job '3282258' (user
> aducoin) has been in state 'Running' for 306 seconds. node 'nyx0900'
> is in state 'Running'
> (job '3282258' will be cancelled)
> 03/17 15:01:23 MJobProcessCompleted(3282258)
> 03/17 15:01:23 MJobProcessTVariables(3282258)
> 03/17 15:01:23 MAMAllocJDebit(A,3282258,SC,EMsg)
> 03/17 15:01:23 MJobSendFB(3282258)
> 03/17 15:01:23 MSysLaunchAction(ASList,)
> 03/17 15:01:23 INFO: job usage sent for job '3282258'
> 03/17 15:01:23 ALERT: job ' 3282258' has invalid system
> queue time (SQ: 1268852118 > ST: 1268840355)
> 03/17 15:01:23 INFO: job ' 3282258' completed.
> QueueTime: 0 RunTime: 11960 Accuracy: 13.84 XFactor: 0.14
> 03/17 15:01:23 INFO: overall statistics. Accuracy: nan
> XFactor: inf
> 03/17 15:01:23 INFO: job '3282258' completed X: 0.138426 T:
> 11960 PS: 47840 A: 0.138426 (RM: nyx/nyx)
> 03/17 15:01:23 MReqCreate(3282258,SrcRQ,DstRQ,TRUE)
> 03/17 15:01:23 INFO: added completed job '3282258', Job
> Completion Time Wed Mar 17 14:58:35
> 03/17 15:01:23 INFO: node 'nyx0900' released from job 3282258
> 03/17 15:01:23 MJobRemove(3282258)
> 03/17 15:01:23 MJobDestroyVM(3282258,EMsg)
> 03/17 15:01:23 MRsvDestroy(3282258,TRUE,TRUE)
> 03/17 15:01:23 MRsvDestroyCredLock(3282258)
> 03/17 15:01:23 MJobDestroy(3282258)
> 03/17 15:06:07 MReqCreate(3282258,SrcRQ,DstRQ,TRUE)
> 03/17 15:06:07 INFO: added completed job '3282258', Job
> Completion Time Wed Mar 17 14:58:35
> 03/17 15:06:07 MJobDestroy(3282258)
> We run thousands a job a day so most jobs are not showing this
> behavior and get charged.
> Brock Palen
> Center for Advanced Computing
> brockp at umich.edu
> On Mar 17, 2010, at 2:11 PM, Scott Jackson wrote:
>> I might expect a few here and there, but on this scale I would say
>> there is something pretty wrong.
>> I would recommend using glsres -I to get a list of ones that have
>> expired but were not removed. Then look for these in the goldd.log to
>> see if Charges were issued for them. You may find that Errors
>> occurred, or you may find that Moab never sent the charge request, or
>> you may find that there is a bug in Gold where it is charging but the
>> reservation is not getting removed (naturally, this is doubtful:).
>> Brock Palen wrote:
>>> We tend to accumulate stale reservations (things that get deleted
>>> with grmres -I)
>>> We have setup a cron job to run grmres -I every night and deletes
>>> between 100 and 500 every day. Should this be happening? What
>>> would be causing this?
>>> Brock Palen
>>> Center for Advanced Computing
>>> brockp at umich.edu
>>> gold-users mailing list
>>> gold-users at supercluster.org
More information about the gold-users