[torqueusers] premature end of message from server kills job
brockp at umich.edu
Fri Jan 22 17:18:42 MST 2010
On Jan 22, 2010, at 6:42 PM, Joshua Bernstein wrote:
> Brock Palen wrote:
>> On the sister mom on nyx0409 the error though is a 'premature end
>> of message from server'
> I've seen this message when the network is under high load, usually
> due to an MPI job running over the Ethernet fabric. Since you have a
> sister MOM, I'm guessing you also have an MPI job here?
Yes MPI job, MPI traffic is over IB
> I've also seen this with nodes using the "forcedeth" network driver,
> when the driver craps out under high load. I can't tell you how
> frustrating that stupid driver has been for me.
No all nodes are running broadcom and using the tg3 driver,
Also MPI traffic is over IB on this host. Though stroage traffic is
> Also, once I had a situation where I had mismatched versions of
> pbs_mom running and the message protocol I guess had changed and
> caused them not to want to speak to each other anymore.
No all moms and the the server are same version.
> Have you looked for any frame errors with ifconfig to see if maybe
> the driver is freaking out under load?
I see 109 dropped packets, ethtool -S eth0 shows rx_discards: 109
> -Joshua Bernstein
> Senior Software Engineer
> Penguin Computing
More information about the torqueusers