[torqueusers] premature end of message from server kills job
Brock Palen
brockp at umich.edu
Fri Jan 22 15:24:23 MST 2010
We had a parallel job die, the error on the mother was:
01/22/2010 14:49:55;0001; pbs_mom;Job;
3046564.nyx.engin.umich.edu;send_sisters: sister #13 (nyx0409) is not
ok (1099)
01/22/2010 14:49:55;0008; pbs_mom;Job;
3046564.nyx.engin.umich.edu;kill_task: killing pid 30970 task 1339
gracefully with sig 15
01/22/2010 14:49:55;0080; pbs_mom;Job;
3046564.nyx.engin.umich.edu;scan_for_terminated: job
3046564.nyx.engin.umich.edu task 1339 terminated, sid=30949
On the sister mom on nyx0409 the error though is a 'premature end of
message from server'
01/22/2010 14:46:02;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version =
2.4.3, loglevel = 0
01/22/2010 14:48:58;0002; pbs_mom;Svr;im_eof;Premature end of
message from addr 141.212.31.100:1023
01/22/2010 14:49:25;0002; pbs_mom;n/
a;mom_server_check_connection;sending hello to server nyx
01/22/2010 14:49:27;0002; pbs_mom;Svr;im_eof;End of File from addr
141.212.31.100:15001
01/22/2010 14:49:27;0002; pbs_mom;n/
a;mom_server_check_connection;sending hello to server nyx
01/22/2010 14:51:02;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version =
2.4.3, loglevel = 0
The ip address: 141.212.31.100 is for our pbs_server,
Network between the hosts looks fine, and I see this error in mom's
every so often, but I don't think I ever noticed it killing a job
before, Any idea what could cause this? we are running pbs_version =
2.4.3
Thanks!
Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
brockp at umich.edu
(734)936-1985
More information about the torqueusers
mailing list