[torqueusers] PBS_MOM kills running jobs when restarted
George Wm Turner
turnerg at indiana.edu
Mon Jul 20 08:48:17 MDT 2009
use the -p option when you restart pbs_mom; from pbs_mom man page
<snip>
-p Specifies the impact on jobs which were in execution when
the
mini-server shut down. On any restart of MOM, the new
mini-
server will not be the parent of any running jobs, MOM has
lost
control of her offspring (not a new situation for a
mother).
With the -p option, Mom will allow the jobs to continue to
run
and monitor them indirectly via polling. The -p option is
mutu-
ally exclusive with the -r option.
george wm turner
high performance systems
812 855 5156
On Jul 20, 2009, at 8:51 AM, David McGiven wrote:
>
> Dear TORQUE users,
>
> I am using pbs_mom torque 2.3.6 in a computing node with 24
> processors.
>
> From time to time, it unexpectedly crashes dumping the following
> error to syslog :
>
> Jul 20 13:21:48 klaus kernel: [3352942.287574] pbs_mom[31758]:
> segfault at 4 rip 417ddf rsp 7fffb6c36e20 error 4
>
> Then, when I restart pbs_mom on that computing node, all the jobs that
> were still running are killed.
>
> I don't understand why. I've looked for an option to tell mom to not
> kill the jobs when restarting but I didn't find anything.
>
> Does anybody know what is causing the problem or how to solve it ?
>
> Thanks in advance.
>
> Best Regards,
> David McGiven
>
> P.S : Those are the messages pbs_mom dumps when started after a crash.
>
> 07/20/2009 13:26:25;0001; pbs_mom;Svr;pbs_mom;No such file or
> directory (2) in read_config, fstat: config
> 07/20/2009 13:26:25;0002; pbs_mom;Svr;setpbsserver;bender
> 07/20/2009 13:26:25;0002; pbs_mom;Svr;mom_server_add;server
> bender added
> 07/20/2009 13:26:25;0002; pbs_mom;n/a;initialize;independent
> 07/20/2009 13:26:25;0080; pbs_mom;Svr;pbs_mom;before init_abort_jobs
> 07/20/2009 13:26:25;0002; pbs_mom;Svr;pbs_mom;Is up
> 07/20/2009 13:26:25;0002; pbs_mom;Svr;setup_program_environment;MOM
> executable path and mtime at launch: /opt/soft/torque-
> 2.3.6/sbin/pbs_mom 1244734930
> 07/20/2009 13:26:25;0002; pbs_mom;n/
> a;mom_server_check_connection;sending hello to server bender
> 07/20/2009 13:26:26;0008; pbs_mom;Job;27571.bender ;kill_task:
> killing pid 3866 task 1 gracefully with sig 15
> 07/20/2009 13:26:31;0008; pbs_mom;Job;27571.bender ;kill_task:
> killing pid 3866 task 1 with sig 9
> 07/20/2009 13:26:31;0008; pbs_mom;Job;27571.bender ;kill_task:
> killing pid 3877 task 1 gracefully with sig 15
> 07/20/2009 13:26:32;0008; pbs_mom;Job;27571.bender ;kill_task:
> killing pid 3884 task 1 gracefully with sig 15
> 07/20/2009 13:26:34;0008; pbs_mom;Job;27565.bender ;kill_task:
> killing pid 31771 task 1 gracefully with sig 15
> 07/20/2009 13:26:39;0008; pbs_mom;Job;27565.bender ;kill_task:
> killing pid 31771 task 1 with sig 9
> 07/20/2009 13:26:39;0008; pbs_mom;Job;27565.bender ;kill_task:
> killing pid 31782 task 1 gracefully with sig 15
> 07/20/2009 13:26:39;0008; pbs_mom;Job;27565.bender ;kill_task:
> killing pid 31789 task 1 gracefully with sig 15
> 07/20/2009 13:26:41;0080; pbs_mom;Svr;preobit_reply;top of
> preobit_reply
> 07/20/2009 13:26:41;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/
> decode_DIS_replySvr worked, top of while loop
> 07/20/2009 13:26:41;0080; pbs_mom;Svr;preobit_reply;in while loop,
> no error from job stat
> 07/20/2009 13:26:42;0008; pbs_mom;Job;27571.bender ;checking job
> post-processing routine
> 07/20/2009 13:26:42;0080; pbs_mom;Job;27571.bender ;obit sent to
> server
> 07/20/2009 13:26:42;0008; pbs_mom;Job;27586.bender ;kill_task:
> killing pid 4918 task 1 gracefully with sig 15
> 07/20/2009 13:26:47;0008; pbs_mom;Job;27586.bender ;kill_task:
> killing pid 4918 task 1 with sig 9
> 07/20/2009 13:26:47;0008; pbs_mom;Job;27586.bender ;kill_task:
> killing pid 4929 task 1 gracefully with sig 15
> 07/20/2009 13:26:47;0008; pbs_mom;Job;27586.bender ;kill_task:
> killing pid 4936 task 1 gracefully with sig 15
> 07/20/2009 13:26:51;0008; pbs_mom;Job;27576.bender ;kill_task:
> killing pid 4023 task 1 gracefully with sig 15
> 07/20/2009 13:26:56;0008; pbs_mom;Job;27576.bender ;kill_task:
> killing pid 4023 task 1 with sig 9
> 07/20/2009 13:26:56;0008; pbs_mom;Job;27576.bender ;kill_task:
> killing pid 4032 task 1 gracefully with sig 15
> 07/20/2009 13:26:56;0008; pbs_mom;Job;27576.bender ;kill_task:
> killing pid 4039 task 1 gracefully with sig 15
> 07/20/2009 13:27:01;0008; pbs_mom;Job;27576.bender ;kill_task:
> killing pid 4039 task 1 with sig 9
> 07/20/2009 13:27:01;0002; pbs_mom;Svr;im_eof;End of File from addr
> 158.109.211.63:15001
> 07/20/2009 13:27:01;0080; pbs_mom;Svr;preobit_reply;top of
> preobit_reply
> 07/20/2009 13:27:01;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/
> decode_DIS_replySvr worked, top of while loop
> 07/20/2009 13:27:01;0080; pbs_mom;Svr;preobit_reply;in while loop,
> no error from job stat
> 07/20/2009 13:27:01;0080; pbs_mom;Svr;preobit_reply;top of
> preobit_reply
> 07/20/2009 13:27:01;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/
> decode_DIS_replySvr worked, top of while loop
> 07/20/2009 13:27:01;0080; pbs_mom;Svr;preobit_reply;in while loop,
> no error from job stat
> 07/20/2009 13:27:01;0002; pbs_mom;n/
> a;mom_server_check_connection;sending hello to server bender
> 07/20/2009 13:27:02;0008; pbs_mom;Job;27565.bender ;checking job
> post-processing routine
>
>
> --
> David McGiven
> Associate Researcher
> Universitat Autonoma de Barcelona
> davidmcgivenn at gmail.com
>
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
More information about the torqueusers
mailing list