[torqueusers] MOM requested job die
Danny Sternkopf
dsternkopf at hpce.nec.com
Tue Jun 5 07:02:59 MDT 2007
Hi,
I'am trying to find out why a 49 node jobs was aborted by Torque after
it was running fine a couple of hours.
From the logs I can see the following:
Superior MOM: (noco207)
06/05/2007 09:36:01;0001; pbs_mom;Svr;pbs_mom;node_bailout,
66289.cacau1.nec POLL failed from node noco031.nec 17)
06/05/2007 09:36:01;0001; pbs_mom;Svr;pbs_mom;node_bailout,
66289.cacau1.nec POLL failed from node noco031.nec 17)
06/05/2007 09:36:01;0001; pbs_mom;Svr;pbs_mom;node_bailout,
66289.cacau1.nec POLL failed from node noco031.nec 17)
06/05/2007 09:36:01;0001; pbs_mom;Svr;pbs_mom;node_bailout,
66289.cacau1.nec POLL failed from node noco031.nec 17)
06/05/2007 09:36:01;0001; pbs_mom;Svr;pbs_mom;node_bailout,
66289.cacau1.nec POLL failed from node noco031.nec 17)
06/05/2007 09:36:16;0002; pbs_mom;n/a;is_update_stat;status update
successfully sent to cacau1
06/05/2007 09:36:37;0008; pbs_mom;Job;66289.cacau1.nec;node 17
(noco031.nec) requested job die, 'EOF' (code 1099) - internal or
network failure attempting to communicate with sister MOM's
noco031:
06/05/2007 09:36:18;0002; pbs_mom;n/a;is_update_stat;status update
successfully sent to cacau1
06/05/2007 09:37:03;0002; pbs_mom;n/a;is_update_stat;status update
successfully sent to cacau1
06/05/2007 09:37:45;0002; pbs_mom;Svr;im_eof;Premature end of message
from addr 172.16.9.207:1023
06/05/2007 09:37:45;0001; pbs_mom;Svr;pbs_mom;im_eof, job
66289.cacau1.nec lost connection to MS on noco207.nec
06/05/2007 09:37:48;0002; pbs_mom;n/a;is_update_stat;status update
successfully sent to cacau1
06/05/2007 09:38:33;0002; pbs_mom;n/a;is_update_stat;status update
successfully sent to cacau1
06/05/2007 09:39:12;0002; pbs_mom;Svr;im_request;connect from
172.16.9.207:1023
06/05/2007 09:39:12;0008; pbs_mom;Job;66289.cacau1.nec;received
request 'KILL_JOB' from 172.16.9.207:1023
It looks like noco031 lost it's connection to Superior MOM for any
reason. Does really noco031 lost it's connection or does the Superior
MOM lost it's connection to noco031?
As far as I can see there were no network problems and the nodes were
okay regarding swapping or high CPU load.
What are reasons for above error messages?
How can I find out more details?
Thank you for your help and Best regards,
Danny
More information about the torqueusers
mailing list