[torqueusers] PBS_MOM kills running jobs when restarted

David McGiven davidmcgivenn at gmail.com
Mon Jul 20 06:51:18 MDT 2009


Dear TORQUE users,

I am using pbs_mom torque 2.3.6 in a computing node with 24 processors.

 From time to time, it unexpectedly crashes dumping the following  
error to syslog :

Jul 20 13:21:48 klaus kernel: [3352942.287574] pbs_mom[31758]:  
segfault at 4 rip 417ddf rsp 7fffb6c36e20 error 4

Then, when I restart pbs_mom on that computing node, all the jobs that  
were still running are killed.

I don't understand why. I've looked for an option to tell mom to not  
kill the jobs when restarting but I didn't find anything.

Does anybody know what is causing the problem or how to solve it ?

Thanks in advance.

Best Regards,
David McGiven

P.S : Those are the messages pbs_mom dumps when started after a crash.

07/20/2009 13:26:25;0001;   pbs_mom;Svr;pbs_mom;No such file or  
directory (2) in read_config, fstat: config
07/20/2009 13:26:25;0002;   pbs_mom;Svr;setpbsserver;bender
07/20/2009 13:26:25;0002;   pbs_mom;Svr;mom_server_add;server  
bender       added
07/20/2009 13:26:25;0002;   pbs_mom;n/a;initialize;independent
07/20/2009 13:26:25;0080;   pbs_mom;Svr;pbs_mom;before init_abort_jobs
07/20/2009 13:26:25;0002;   pbs_mom;Svr;pbs_mom;Is up
07/20/2009 13:26:25;0002;   pbs_mom;Svr;setup_program_environment;MOM  
executable path and mtime at launch: /opt/soft/torque-
2.3.6/sbin/pbs_mom 1244734930
07/20/2009 13:26:25;0002;   pbs_mom;n/ 
a;mom_server_check_connection;sending hello to server bender
07/20/2009 13:26:26;0008;   pbs_mom;Job;27571.bender   ;kill_task:  
killing pid 3866 task 1 gracefully with sig 15
07/20/2009 13:26:31;0008;   pbs_mom;Job;27571.bender   ;kill_task:  
killing pid 3866 task 1 with sig 9
07/20/2009 13:26:31;0008;   pbs_mom;Job;27571.bender   ;kill_task:  
killing pid 3877 task 1 gracefully with sig 15
07/20/2009 13:26:32;0008;   pbs_mom;Job;27571.bender   ;kill_task:  
killing pid 3884 task 1 gracefully with sig 15
07/20/2009 13:26:34;0008;   pbs_mom;Job;27565.bender   ;kill_task:  
killing pid 31771 task 1 gracefully with sig 15
07/20/2009 13:26:39;0008;   pbs_mom;Job;27565.bender   ;kill_task:  
killing pid 31771 task 1 with sig 9
07/20/2009 13:26:39;0008;   pbs_mom;Job;27565.bender   ;kill_task:  
killing pid 31782 task 1 gracefully with sig 15
07/20/2009 13:26:39;0008;   pbs_mom;Job;27565.bender   ;kill_task:  
killing pid 31789 task 1 gracefully with sig 15
07/20/2009 13:26:41;0080;   pbs_mom;Svr;preobit_reply;top of  
preobit_reply
07/20/2009 13:26:41;0080;   pbs_mom;Svr;preobit_reply;DIS_reply_read/ 
decode_DIS_replySvr worked, top of while loop
07/20/2009 13:26:41;0080;   pbs_mom;Svr;preobit_reply;in while loop,  
no error from job stat
07/20/2009 13:26:42;0008;   pbs_mom;Job;27571.bender   ;checking job  
post-processing routine
07/20/2009 13:26:42;0080;   pbs_mom;Job;27571.bender   ;obit sent to  
server
07/20/2009 13:26:42;0008;   pbs_mom;Job;27586.bender   ;kill_task:  
killing pid 4918 task 1 gracefully with sig 15
07/20/2009 13:26:47;0008;   pbs_mom;Job;27586.bender   ;kill_task:  
killing pid 4918 task 1 with sig 9
07/20/2009 13:26:47;0008;   pbs_mom;Job;27586.bender   ;kill_task:  
killing pid 4929 task 1 gracefully with sig 15
07/20/2009 13:26:47;0008;   pbs_mom;Job;27586.bender   ;kill_task:  
killing pid 4936 task 1 gracefully with sig 15
07/20/2009 13:26:51;0008;   pbs_mom;Job;27576.bender   ;kill_task:  
killing pid 4023 task 1 gracefully with sig 15
07/20/2009 13:26:56;0008;   pbs_mom;Job;27576.bender   ;kill_task:  
killing pid 4023 task 1 with sig 9
07/20/2009 13:26:56;0008;   pbs_mom;Job;27576.bender   ;kill_task:  
killing pid 4032 task 1 gracefully with sig 15
07/20/2009 13:26:56;0008;   pbs_mom;Job;27576.bender   ;kill_task:  
killing pid 4039 task 1 gracefully with sig 15
07/20/2009 13:27:01;0008;   pbs_mom;Job;27576.bender   ;kill_task:  
killing pid 4039 task 1 with sig 9
07/20/2009 13:27:01;0002;   pbs_mom;Svr;im_eof;End of File from addr  
158.109.211.63:15001
07/20/2009 13:27:01;0080;   pbs_mom;Svr;preobit_reply;top of  
preobit_reply
07/20/2009 13:27:01;0080;   pbs_mom;Svr;preobit_reply;DIS_reply_read/ 
decode_DIS_replySvr worked, top of while loop
07/20/2009 13:27:01;0080;   pbs_mom;Svr;preobit_reply;in while loop,  
no error from job stat
07/20/2009 13:27:01;0080;   pbs_mom;Svr;preobit_reply;top of  
preobit_reply
07/20/2009 13:27:01;0080;   pbs_mom;Svr;preobit_reply;DIS_reply_read/ 
decode_DIS_replySvr worked, top of while loop
07/20/2009 13:27:01;0080;   pbs_mom;Svr;preobit_reply;in while loop,  
no error from job stat
07/20/2009 13:27:01;0002;   pbs_mom;n/ 
a;mom_server_check_connection;sending hello to server bender
07/20/2009 13:27:02;0008;   pbs_mom;Job;27565.bender   ;checking job  
post-processing routine


--
David McGiven
Associate Researcher
Universitat Autonoma de Barcelona
davidmcgivenn at gmail.com





More information about the torqueusers mailing list