[torqueusers] premature end of message from server kills job
jbernstein at penguincomputing.com
Fri Jan 22 17:51:30 MST 2010
Brock Palen wrote:
> On Jan 22, 2010, at 6:42 PM, Joshua Bernstein wrote:
>> Brock Palen wrote:
>>> On the sister mom on nyx0409 the error though is a 'premature end
>>> of message from server'
>> I've seen this message when the network is under high load, usually
>> due to an MPI job running over the Ethernet fabric. Since you have a
>> sister MOM, I'm guessing you also have an MPI job here?
> Yes MPI job, MPI traffic is over IB
>> I've also seen this with nodes using the "forcedeth" network driver,
>> when the driver craps out under high load. I can't tell you how
>> frustrating that stupid driver has been for me.
> No all nodes are running broadcom and using the tg3 driver,
> Also MPI traffic is over IB on this host. Though stroage traffic is not.
tg3 is almost as bad in my experience then forcedth. I would check modinfo on
the driver and see if you can get yourself a newer version.
>> Also, once I had a situation where I had mismatched versions of
>> pbs_mom running and the message protocol I guess had changed and
>> caused them not to want to speak to each other anymore.
> No all moms and the the server are same version.
>> Have you looked for any frame errors with ifconfig to see if maybe the
>> driver is freaking out under load?
> I see 109 dropped packets, ethtool -S eth0 shows rx_discards: 109
I don't think you should see any dropped packets under normal conditions. I
would look into updating the driver.
More information about the torqueusers