[torqueusers] MOM requested job die

Danny Sternkopf dsternkopf at hpce.nec.com
Tue Jun 5 07:02:59 MDT 2007


Hi,

I'am trying to find out why a 49 node jobs was aborted by Torque after 
it was running fine a couple of hours.
 From the logs I can see the following:

Superior MOM: (noco207)
06/05/2007 09:36:01;0001;   pbs_mom;Svr;pbs_mom;node_bailout, 
66289.cacau1.nec POLL failed from node noco031.nec 17)
06/05/2007 09:36:01;0001;   pbs_mom;Svr;pbs_mom;node_bailout, 
66289.cacau1.nec POLL failed from node noco031.nec 17)
06/05/2007 09:36:01;0001;   pbs_mom;Svr;pbs_mom;node_bailout, 
66289.cacau1.nec POLL failed from node noco031.nec 17)
06/05/2007 09:36:01;0001;   pbs_mom;Svr;pbs_mom;node_bailout, 
66289.cacau1.nec POLL failed from node noco031.nec 17)
06/05/2007 09:36:01;0001;   pbs_mom;Svr;pbs_mom;node_bailout, 
66289.cacau1.nec POLL failed from node noco031.nec 17)
06/05/2007 09:36:16;0002;   pbs_mom;n/a;is_update_stat;status update 
successfully sent to cacau1
06/05/2007 09:36:37;0008;   pbs_mom;Job;66289.cacau1.nec;node 17 
(noco031.nec) requested job die, 'EOF' (code 1099) - internal or
  network failure attempting to communicate with sister MOM's

noco031:
06/05/2007 09:36:18;0002;   pbs_mom;n/a;is_update_stat;status update 
successfully sent to cacau1
06/05/2007 09:37:03;0002;   pbs_mom;n/a;is_update_stat;status update 
successfully sent to cacau1
06/05/2007 09:37:45;0002;   pbs_mom;Svr;im_eof;Premature end of message 
from addr 172.16.9.207:1023
06/05/2007 09:37:45;0001;   pbs_mom;Svr;pbs_mom;im_eof, job 
66289.cacau1.nec lost connection to MS on noco207.nec
06/05/2007 09:37:48;0002;   pbs_mom;n/a;is_update_stat;status update 
successfully sent to cacau1
06/05/2007 09:38:33;0002;   pbs_mom;n/a;is_update_stat;status update 
successfully sent to cacau1
06/05/2007 09:39:12;0002;   pbs_mom;Svr;im_request;connect from 
172.16.9.207:1023
06/05/2007 09:39:12;0008;   pbs_mom;Job;66289.cacau1.nec;received 
request 'KILL_JOB' from 172.16.9.207:1023

It looks like noco031 lost it's connection to Superior MOM for any 
reason. Does really noco031 lost it's connection or does the Superior 
MOM lost it's connection to noco031?

As far as I can see there were no network problems and the nodes were 
okay regarding swapping or high CPU load.

What are reasons for above error messages?
How can I find out more details?

Thank you for your help and Best regards,

Danny



More information about the torqueusers mailing list