[torqueusers] premature end of message from server kills job

Brock Palen brockp at umich.edu
Fri Jan 22 15:24:23 MST 2010


We had a parallel job die, the error on the mother was:

01/22/2010 14:49:55;0001;   pbs_mom;Job; 
3046564.nyx.engin.umich.edu;send_sisters:  sister #13 (nyx0409) is not  
ok (1099)
01/22/2010 14:49:55;0008;   pbs_mom;Job; 
3046564.nyx.engin.umich.edu;kill_task: killing pid 30970 task 1339  
gracefully with sig 15
01/22/2010 14:49:55;0080;   pbs_mom;Job; 
3046564.nyx.engin.umich.edu;scan_for_terminated: job  
3046564.nyx.engin.umich.edu task 1339 terminated, sid=30949


On the sister mom on nyx0409  the error though is a 'premature end of  
message from server'

01/22/2010 14:46:02;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version =  
2.4.3, loglevel = 0
01/22/2010 14:48:58;0002;   pbs_mom;Svr;im_eof;Premature end of  
message from addr 141.212.31.100:1023
01/22/2010 14:49:25;0002;   pbs_mom;n/ 
a;mom_server_check_connection;sending hello to server nyx
01/22/2010 14:49:27;0002;   pbs_mom;Svr;im_eof;End of File from addr  
141.212.31.100:15001
01/22/2010 14:49:27;0002;   pbs_mom;n/ 
a;mom_server_check_connection;sending hello to server nyx
01/22/2010 14:51:02;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version =  
2.4.3, loglevel = 0


The ip address:  141.212.31.100  is for our pbs_server,

Network between the hosts looks fine, and I see this error in mom's  
every so often, but I don't think I ever noticed it killing a job  
before,  Any idea what could cause this? we are running pbs_version =  
2.4.3

Thanks!

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
brockp at umich.edu
(734)936-1985





More information about the torqueusers mailing list