[Mauiusers] =?windows-1251?q?slurm_1=2E3=2E8_+_maui-3=2E2=2E6p21-?=

А gip_gop at mail.ru
Fri Jun 26 13:02:20 MDT 2009


....
  
  Would thanks you a lot for any help.

  May be someone using slurm+maui could set "LOGLEVEL 7" in maui.cfg and check, if logs are similar to mine. Especially to check, if match sizes in bytes in line with MSecGetChecksum and in previouse line. In my logs they do not match sometimes: 4435 and 4363, for example.

06/26 13:34:47 INFO:     4435 of 4435 bytes read from sd 7
06/26 13:34:47 MSecGetChecksum(Buf,4363,Checksum,DES,CSKey)




  Correct logs:

06/26 13:34:47 ServerProcessRequests()
06/26 13:34:47 MLogRoll(NULL,0,1)
06/26 13:34:47 INFO:     not rolling logs (441447 < 10000000)
06/26 13:34:47 MResAdjust(NULL,0,0)
06/26 13:34:47 MJobSetAttr(,PAL,Value,1,2)
06/26 13:34:47 INFO:     job flags for job : 0, req napolicy=SHARED
06/26 13:34:47 MJobSetAttr(,GAttr,Value,1,5)
06/26 13:34:47 MStatInitializeActiveSysUsage()
06/26 13:34:47 MStatClearUsage([NONE],Active)
06/26 13:34:47 ServerUpdate()
06/26 13:34:47 MSysUpdateTime()
06/26 13:34:47 INFO:     starting iteration 60
06/26 13:34:47 MSchedProcessJobs()
06/26 13:34:47 MRMGetInfo()
06/26 13:34:47 MClusterClearUsage()
06/26 13:34:47 MRMClusterQuery()
06/26 13:34:47 MWikiClusterLoadInfo(n00,RCount,EMsg,SC)
06/26 13:34:47 MWikiDoCommand(n00,7321,9000000,CHECKSUM,CMD=GETNODES ARG=0:ALL,Data,DataSize,SC)
06/26 13:34:47 MSUConnect(S,FALSE,EMsg)
06/26 13:34:47 INFO:     trying to connect to 10.1.0.1 (Port: 7321)
06/26 13:34:47 INFO:     non-blocking mode established
06/26 13:34:47 MSUSelectWrite(7,9000000)
06/26 13:34:47 INFO:     successful connect to TCP server (sd: 7)
06/26 13:34:47 MSUSendData(S,9000000,TRUE,FALSE)
06/26 13:34:47 MSecGetChecksum2(Buf1,27,Buf2,22,Checksum,DES,CSKey)
06/26 13:34:47 INFO:     header created '00000069
CK=2c5f6971a5844eef TS=1246008887 AUTH=root DT='
06/26 13:34:47 INFO:     sending short packet '00000069
CK=2c5f6971a5844eef TS=1246008887 AUTH=root DT=CMD=GETNODES ARG=0:ALL'
06/26 13:34:47 MSUSendPacket(7,Buf,78,9000000,SC)
06/26 13:34:47 MSUSelectWrite(7,9000000)
06/26 13:34:47 INFO:     packet sent (78 bytes of 78)
06/26 13:34:47 INFO:     command sent to server
06/26 13:34:47 INFO:     message sent: 'CMD=GETNODES ARG=0:ALL'
06/26 13:34:47 MSURecvData(S,9000000,TRUE,SC,EMsg)
06/26 13:34:47 MSURecvPacket(7,BufP,9,NULL,9000000,SC)
06/26 13:34:47 MSUSelectRead(7,9000000)
06/26 13:34:47 INFO:     9 of 9 bytes read from sd 7
06/26 13:34:47 MSURecvPacket(7,BufP,4435,NULL,9000000,SC)
06/26 13:34:47 MSUSelectRead(7,9000000)
06/26 13:34:47 INFO:     4435 of 4435 bytes read from sd 7
06/26 13:34:47 MSecGetChecksum(Buf,4363,Checksum,DES,CSKey)
06/26 13:34:47 ALERT:    checksum does not match (351c7a893a2e1699:b4584308b241ec39)  request 'TS=1246008887 AUTH=slurm DT=SC
=0 ARG=64#n01:STATE=Running;ARCH=x86_64;OS=Linux;CMEMORY=10240;CDISK=0;CPROC=8;#n02:STATE='
06/26 13:34:47 ERROR:    cannot receive data from server n00:7321
06/26 13:34:47 MSUDisconnect(S)
06/26 13:34:47 ALERT:    cannot get node list from WIKI RM
06/26 13:34:47 ALERT:    cannot load cluster resources on RM (RM 'n00' failed in function 'clusterquery')
06/26 13:34:47 WARNING:  no resources detected
06/26 13:34:47 MRMWorkloadQuery()
06/26 13:34:47 MWikiWorkloadQuery(n00,JCount,SC)
06/26 13:34:47 MWikiDoCommand(n00,7321,9000000,CHECKSUM,CMD=GETJOBS ARG=0:ALL,Data,DataSize,SC)
06/26 13:34:47 MSUConnect(S,FALSE,EMsg)
06/26 13:34:47 INFO:     trying to connect to 10.1.0.1 (Port: 7321)
06/26 13:34:47 INFO:     non-blocking mode established
06/26 13:34:47 MSUSelectWrite(7,9000000)
06/26 13:34:47 INFO:     successful connect to TCP server (sd: 7)
06/26 13:34:47 MSUSendData(S,9000000,TRUE,FALSE)
06/26 13:34:47 MSecGetChecksum2(Buf1,27,Buf2,21,Checksum,DES,CSKey)
06/26 13:34:47 INFO:     header created '00000068
CK=4e880ad31a667b74 TS=1246008887 AUTH=root DT='
06/26 13:34:47 INFO:     sending short packet '00000068
CK=4e880ad31a667b74 TS=1246008887 AUTH=root DT=CMD=GETJOBS ARG=0:ALL'
06/26 13:34:47 MSUSendPacket(7,Buf,77,9000000,SC)
06/26 13:34:47 MSUSelectWrite(7,9000000)
06/26 13:34:47 INFO:     packet sent (77 bytes of 77)
06/26 13:34:47 INFO:     command sent to server
06/26 13:34:47 INFO:     message sent: 'CMD=GETJOBS ARG=0:ALL'
06/26 13:34:47 MSURecvData(S,9000000,TRUE,SC,EMsg)
06/26 13:34:47 MSURecvPacket(7,BufP,9,NULL,9000000,SC)
06/26 13:34:47 MSUSelectRead(7,9000000)
06/26 13:34:47 INFO:     3704 of 3704 bytes read from sd 7
06/26 13:34:47 MSecGetChecksum(Buf,3632,Checksum,DES,CSKey)
06/26 13:34:47 ALERT:    checksum does not match (e3743199c5566b9a:9ab1d151dd49049c)  request 'TS=1246008887 AUTH=slurm DT=SC
=0 ARG=17#191814:STATE=Running;TASKLIST=:n01;UPDATETIME=1246007985;WCLIMIT=31536000;TASKS='
06/26 13:34:47 ERROR:    cannot receive data from server n00:7321
06/26 13:34:47 MSUDisconnect(S)
06/26 13:34:47 ALERT:    cannot get job list from WIKI RM
06/26 13:34:47 ALERT:    cannot load cluster workload on RM (RM 'n00' failed in function 'workloadquery')
06/26 13:34:47 WARNING:  no workload detected





More information about the mauiusers mailing list