[torqueusers] SOLVED: Re: maui crash after Successfully charged job

Eva Hocks hocks at sdsc.edu
Mon Oct 21 13:52:33 MDT 2013



Just as a follow up, in our case the problem was file ownership.

maui runs under account "maui" but after updating maui via rpm, the
updated directories were root-owned and therefore not writable by the
maui account.

Changing the onwership kept maui stable.

-Eva


On Tue, 8 Oct 2013, Eva Hocks wrote:

>
>
> maui 3.3.1, torque 4.2.5 and gold 2.2.0.5
>
>
> maui seems to be dead or hung up about every hour when communicating to the
> allocation manager (gold)
>
>
> maui crashed ( the daemon is not running) after joballoccharge
>
> 10/08 13:41:30 MSUDisconnect(S)
> 10/08 13:41:30 MSysEMSubmit(EM,allocation-manager,joballoccharge,842059[1])
> 10/08 13:41:30 MJobWriteStats(842059[1])
> 10/08 13:41:30 MJobToTString(842059[1],230,Buf,65536)
>
> 10/08 14:08:40 MSUDisconnect(S)
> 10/08 14:08:40 MSysEMSubmit(EM,allocation-manager,joballoccharge,842160[46])
> 10/08 14:08:40 MJobWriteStats(842160[46])
> 10/08 14:08:40 MJobToTString(842160[46],230,Buf,65536)
>
> 10/08 15:59:05 MSUDisconnect(S)
> 10/08 15:59:05 MSysEMSubmit(EM,allocation-manager,joballoccharge,841595)
> 10/08 15:59:05 MJobWriteStats(841595)
> 10/08 15:59:05 MJobToTString(841595,230,Buf,65536)
>
>
>
> hung situation:
>
> 10/08 15:26:01 MAMAllocJReserve(841268,RIndex,ErrMsg)
> 10/08 15:26:01 MS3DoCommand(allocation-manager,NULL,OBuf,ODE,SC,EMsg)
> 10/08 15:26:01 MSysEMSubmit(EM,scheduler,comcom,scheduler,allocation-manager;)
> 10/08 15:26:01 MSUConnect(S,TRUE,EMsg)
> 10/08 15:26:01 MSUSendData(S,15000000,FALSE,FALSE)
> 10/08 15:26:01 MSecGetChecksum(Buf,378,Checksum,HMAC64,CSKey)
> 10/08 15:26:01 MSUSendPacket(8,Buf,710,15000000,SC)
> 10/08 15:26:01 INFO:     packet sent (710 bytes of 710)
> 10/08 15:26:01 INFO:     command sent to server
> 10/08 15:26:01 INFO:     message sent: '<XML>'
> 10/08 15:26:01 MSURecvData(S,15000000,FALSE,SC,EMsg)
> 10/08 15:26:01 MSURecvPacket(8,BufP,1024,
>
>
>
> 10/08 14:31:01 MAMAllocJReserve(840568,RIndex,ErrMsg)
> 10/08 14:31:01 MS3DoCommand(allocation-manager,NULL,OBuf,ODE,SC,EMsg)
> 10/08 14:31:01 MSysEMSubmit(EM,scheduler,comcom,scheduler,allocation-manager;)
> 10/08 14:31:01 MSUConnect(S,TRUE,EMsg)
> 10/08 14:31:01 MSUSendData(S,15000000,FALSE,FALSE)
> 10/08 14:31:01 MSecGetChecksum(Buf,377,Checksum,HMAC64,CSKey)
> 10/08 14:31:01 MSUSendPacket(8,Buf,709,15000000,SC)
> 10/08 14:31:01 INFO:     packet sent (709 bytes of 709)
> 10/08 14:31:01 INFO:     command sent to server
> 10/08 14:31:01 INFO:     message sent: '<XML>'
> 10/08 14:31:01 MSURecvData(S,15000000,FALSE,SC,EMsg)
> 10/08 14:31:01 MSURecvPacket(8,BufP,1024,
>
>
>
> Anyone any insight and hint how to prevent the crashes?
>
> Thanks
> Eva
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>



More information about the torqueusers mailing list