[torqueusers] premature end of message from server kills job

Joshua Bernstein jbernstein at penguincomputing.com
Fri Jan 22 16:42:07 MST 2010



Brock Palen wrote:
> On the sister mom on nyx0409  the error though is a 'premature end of  
> message from server'

I've seen this message when the network is under high load, usually due to an 
MPI job running over the Ethernet fabric. Since you have a sister MOM, I'm 
guessing you also have an MPI job here?

I've also seen this with nodes using the "forcedeth" network driver, when the 
driver craps out under high load. I can't tell you how frustrating that stupid 
driver has been for me.

Also, once I had a situation where I had mismatched versions of pbs_mom running 
and the message protocol I guess had changed and caused them not to want to 
speak to each other anymore.

Have you looked for any frame errors with ifconfig to see if maybe the driver is 
freaking out under load?

-Joshua Bernstein
Senior Software Engineer
Penguin Computing


More information about the torqueusers mailing list