[torqueusers] premature end of message from server kills job

Brock Palen brockp at umich.edu
Fri Jan 22 17:18:42 MST 2010


On Jan 22, 2010, at 6:42 PM, Joshua Bernstein wrote:

>
>
> Brock Palen wrote:
>> On the sister mom on nyx0409  the error though is a 'premature end  
>> of  message from server'
>
> I've seen this message when the network is under high load, usually  
> due to an MPI job running over the Ethernet fabric. Since you have a  
> sister MOM, I'm guessing you also have an MPI job here?

Yes MPI job, MPI traffic is over IB

>
> I've also seen this with nodes using the "forcedeth" network driver,  
> when the driver craps out under high load. I can't tell you how  
> frustrating that stupid driver has been for me.

No all nodes are running broadcom and using the tg3 driver,
Also MPI traffic is over IB on this host.  Though stroage traffic is  
not.

>
> Also, once I had a situation where I had mismatched versions of  
> pbs_mom running and the message protocol I guess had changed and  
> caused them not to want to speak to each other anymore.

No all moms and the the server are same version.

>
> Have you looked for any frame errors with ifconfig to see if maybe  
> the driver is freaking out under load?

I see 109 dropped packets, ethtool -S eth0  shows rx_discards: 109

>
> -Joshua Bernstein
> Senior Software Engineer
> Penguin Computing
>
>



More information about the torqueusers mailing list