[torqueusers] premature end of message from server kills job

Joshua Bernstein jbernstein at penguincomputing.com
Fri Jan 22 17:51:30 MST 2010



Brock Palen wrote:
> On Jan 22, 2010, at 6:42 PM, Joshua Bernstein wrote:
> 
>>
>>
>> Brock Palen wrote:
>>> On the sister mom on nyx0409  the error though is a 'premature end 
>>> of  message from server'
>>
>> I've seen this message when the network is under high load, usually 
>> due to an MPI job running over the Ethernet fabric. Since you have a 
>> sister MOM, I'm guessing you also have an MPI job here?
> 
> Yes MPI job, MPI traffic is over IB
> 
>>
>> I've also seen this with nodes using the "forcedeth" network driver, 
>> when the driver craps out under high load. I can't tell you how 
>> frustrating that stupid driver has been for me.
> 
> No all nodes are running broadcom and using the tg3 driver,
> Also MPI traffic is over IB on this host.  Though stroage traffic is not.

tg3 is almost as bad in my experience then forcedth. I would check modinfo on 
the driver and see if you can get yourself a newer version.

>>
>> Also, once I had a situation where I had mismatched versions of 
>> pbs_mom running and the message protocol I guess had changed and 
>> caused them not to want to speak to each other anymore.
> 
> No all moms and the the server are same version.
> 
>>
>> Have you looked for any frame errors with ifconfig to see if maybe the 
>> driver is freaking out under load?
> 
> I see 109 dropped packets, ethtool -S eth0  shows rx_discards: 109

I don't think you should see any dropped packets under normal conditions. I 
would look into updating the driver.

-Josh


More information about the torqueusers mailing list