[gold-users] are stale reservations normal
Brock Palen
brockp at umich.edu
Thu Mar 18 11:38:21 MDT 2010
gold at cac-admin01$ glsjob -J 3282258
Id JobId User Project Machine Queue QualityOfService Stage
Charge Processors Nodes WallDuration StartTime EndTime Description
------- ------- ------- ------- ------- ----- ---------------- -------
------ ---------- ----- ------------ --------- ------- -----------
4427406 3282258 aducoin ylyoung nyx ylyoung
Reserve 0 4 86400
AMCFG[bank] SERVER=gold://cac-admin01.engin.umich.edu
JOBFAILUREACTION=IGNORE CHARGEPOLICY=DEBITALLWC TIMEOUT=30
I am not so sure about the Moab log, your asking for, under normal
conditions our moab.log has lines like this:
03/18 10:23:35 MAMAllocJDebit(A,3287524,SC,EMsg)
03/18 10:23:35 MS3DoCommand(allocation-manager,NULL,OBuf,ODE,SC,EMsg)
03/18 10:23:35 MSUSendData(S,30000000,FALSE,FALSE,SC,NULL)
03/18 10:23:35 INFO: packet sent (696 bytes of 696)
03/18 10:23:35 INFO: command sent to server
03/18 10:23:35 INFO: message sent: '<XML>'
03/18 10:23:35 MSURecvData(,30000000,FALSE,SC,EMsg)
03/18 10:23:35 MSURecvPacket(13,BufP,1024,^M
^M
,30000000,SC)
03/18 10:23:36 MSURecvPacket(13,BufP,1024,^M
,30000000,SC)
03/18 10:23:36 MSURecvPacket(13,BufP,341,NULL,30000000,SC)
03/18 10:23:36 INFO: response received from server
03/18 10:23:36 INFO: response received: '<?xml version="1.0"
encoding="UTF-8"?>
<Envelope><Body><Response actor="gold"><Status><Value>Success</
Value><Code>000</Code><Message>Successfully charged job 3287524 for 10
credits
1 reservations were removed</Message></Status><Count>10</
Count><Data><Charge><Amount>10</Amount><Job>4457902</Job></Charge></
Data></Response></Body></Envelope>
'
03/18 10:23:36 MSUDisconnect(13)
03/18 10:23:36 INFO: command response '<?xml version="1.0"
encoding="UTF-8"?>
<Envelope><Body><Response actor="gold"><Status><Value>Success</
Value><Code>000</Code><Message>Successfully charged job 3287524 for 10
credits
1 reservations were removed</Message></Status><Count>10</
Count><Data><Charge><Amount>10</Amount><Job>4457902</Job></Charge></
Data></Response></Body></Envelope>
Note that the log for the job that does not get charged, Goe from
MAMAllocJDebit() to MJobSendFB and I get the alerts invlalid system
queue time, and I never see XML lines, or any data showing up in
goldd.log
Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
brockp at umich.edu
(734)936-1985
On Mar 18, 2010, at 1:21 PM, Scott Jackson wrote:
> Brock,
>
> It appears to me that Moab is trying to charge -- at least that is
> what I think MAMAllocJDebit should be doing. You would need to bump
> moab's LOGLEVEL up to about 7 to see the details of what it might be
> sending to Gold here. Please increase the loglevel and give me an
> excerpt.
>
> Also, I would be interested to see what the related glsjob shows for
> this job (Reserve or Charge). This should indicate the last job
> action taken against this job was:
>
> glsjob -J 3282258
>
> Also, please send your moab.cfg. I want to see what your
> CHARGEPOLICY is. If you have something like DEBITSUCCESSFULWC, then
> it will only debit for jobs it deems to be successful. If Moab is
> marking this as a failed job, it may not be charging. I would
> generally recommend using the ALL variants of CHARGEPOLICY (like
> DEBITALLWC). Excuse me if I misspelled some of these policy names, I
> did not look them up:)
>
> Thanks,
>
> Scott
>
>
> Brock Palen wrote:
>> It looks like moab is getting in a odd state and not issuing the
>> charge,
>>
>> glsres -I
>> 4380349 3282258 345600 2010-03-17 11:39:13 2010-03-18 11:49:13
>> 4427406 aducoin ylyoung nyx 2
>>
>> In the moab logs I only see this Alert over and over:
>>
>> 03/17 11:44:59 ALERT: job '3282258' has been in state 'Running'
>> for 306 seconds. node 'nyx0900' is in state 'Running' (job
>> '3282258' will be cancelled
>> )
>> 03/17 11:44:59 MSysRegEvent(JOBCORRUPTION: job '3282258' (user
>> aducoin) has been in state 'Running' for 306 seconds. node
>> 'nyx0900' is in state 'Running'
>> (job '3282258' will be cancelled)
>>
>> 03/17 15:01:23 MJobProcessCompleted(3282258)
>> 03/17 15:01:23 MJobProcessTVariables(3282258)
>> 03/17 15:01:23 MAMAllocJDebit(A,3282258,SC,EMsg)
>> 03/17 15:01:23 MJobSendFB(3282258)
>> 03/17 15:01:23 MSysLaunchAction(ASList,)
>> 03/17 15:01:23 INFO: job usage sent for job '3282258'
>> 03/17 15:01:23 ALERT: job ' 3282258' has invalid
>> system queue time (SQ: 1268852118 > ST: 1268840355)
>> 03/17 15:01:23 INFO: job ' 3282258' completed.
>> QueueTime: 0 RunTime: 11960 Accuracy: 13.84 XFactor: 0.14
>> 03/17 15:01:23 INFO: overall statistics. Accuracy: nan
>> XFactor: inf
>> 03/17 15:01:23 INFO: job '3282258' completed X: 0.138426 T:
>> 11960 PS: 47840 A: 0.138426 (RM: nyx/nyx)
>> 03/17 15:01:23 MReqCreate(3282258,SrcRQ,DstRQ,TRUE)
>> 03/17 15:01:23 INFO: added completed job '3282258', Job
>> Completion Time Wed Mar 17 14:58:35
>>
>> 03/17 15:01:23 INFO: node 'nyx0900' released from job 3282258
>> 03/17 15:01:23 MJobRemove(3282258)
>> 03/17 15:01:23 MJobDestroyVM(3282258,EMsg)
>> 03/17 15:01:23 MRsvDestroy(3282258,TRUE,TRUE)
>> 03/17 15:01:23 MRsvDestroyCredLock(3282258)
>> 03/17 15:01:23 MJobDestroy(3282258)
>>
>> 03/17 15:06:07 MReqCreate(3282258,SrcRQ,DstRQ,TRUE)
>> 03/17 15:06:07 INFO: added completed job '3282258', Job
>> Completion Time Wed Mar 17 14:58:35
>> 03/17 15:06:07 MJobDestroy(3282258)
>>
>>
>> We run thousands a job a day so most jobs are not showing this
>> behavior and get charged.
>>
>>
>> Brock Palen
>> www.umich.edu/~brockp
>> Center for Advanced Computing
>> brockp at umich.edu
>> (734)936-1985
>>
>>
>>
>> On Mar 17, 2010, at 2:11 PM, Scott Jackson wrote:
>>
>>> Brock,
>>>
>>> I might expect a few here and there, but on this scale I would say
>>> there is something pretty wrong.
>>>
>>> I would recommend using glsres -I to get a list of ones that have
>>> expired but were not removed. Then look for these in the goldd.log
>>> to see if Charges were issued for them. You may find that Errors
>>> occurred, or you may find that Moab never sent the charge request,
>>> or you may find that there is a bug in Gold where it is charging
>>> but the reservation is not getting removed (naturally, this is
>>> doubtful:).
>>>
>>> Scott
>>>
>>>
>>> Brock Palen wrote:
>>>> We tend to accumulate stale reservations (things that get deleted
>>>> with grmres -I)
>>>>
>>>> We have setup a cron job to run grmres -I every night and
>>>> deletes between 100 and 500 every day. Should this be
>>>> happening? What would be causing this?
>>>>
>>>> Thanks
>>>>
>>>> Brock Palen
>>>> www.umich.edu/~brockp
>>>> Center for Advanced Computing
>>>> brockp at umich.edu
>>>> (734)936-1985
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> gold-users mailing list
>>>> gold-users at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/gold-users
>>>>
>>>
>>>
>>>
>>
>
>
>
More information about the gold-users
mailing list