[torqueusers] premature end of message from server kills job
jbernstein at penguincomputing.com
Fri Jan 22 16:42:07 MST 2010
Brock Palen wrote:
> On the sister mom on nyx0409 the error though is a 'premature end of
> message from server'
I've seen this message when the network is under high load, usually due to an
MPI job running over the Ethernet fabric. Since you have a sister MOM, I'm
guessing you also have an MPI job here?
I've also seen this with nodes using the "forcedeth" network driver, when the
driver craps out under high load. I can't tell you how frustrating that stupid
driver has been for me.
Also, once I had a situation where I had mismatched versions of pbs_mom running
and the message protocol I guess had changed and caused them not to want to
speak to each other anymore.
Have you looked for any frame errors with ifconfig to see if maybe the driver is
freaking out under load?
Senior Software Engineer
More information about the torqueusers