[torqueusers] PBS_MOM kills running jobs when restarted

David McGiven davidmcgivenn at gmail.com
Mon Jul 20 08:55:39 MDT 2009


George,

Too bad I couldn't see that on the man page when I checked it.

Thank you very much.

David


On Jul 20, 2009, at 4:48 PM, George Wm Turner wrote:

> use the -p option when you restart pbs_mom; from pbs_mom man page
>
> <snip>
> -p     Specifies  the  impact  on jobs which were in execution when  
> the
>        mini-server shut down.  On any restart of  MOM,  the  new   
> mini-
>        server  will not be the parent of any running jobs, MOM has  
> lost
>        control of her offspring (not a new  situation  for  a   
> mother).
>        With  the  -p option, Mom will allow the jobs to continue to  
> run
>        and monitor them indirectly via polling.  The -p option is  
> mutu-
>        ally exclusive with the -r option.
>
>
> george wm turner
> high performance systems
> 812 855 5156
>
>
>
> On Jul 20, 2009, at 8:51 AM, David McGiven wrote:
>
>>
>> Dear TORQUE users,
>>
>> I am using pbs_mom torque 2.3.6 in a computing node with 24  
>> processors.
>>
>> From time to time, it unexpectedly crashes dumping the following
>> error to syslog :
>>
>> Jul 20 13:21:48 klaus kernel: [3352942.287574] pbs_mom[31758]:
>> segfault at 4 rip 417ddf rsp 7fffb6c36e20 error 4
>>
>> Then, when I restart pbs_mom on that computing node, all the jobs  
>> that
>> were still running are killed.
>>
>> I don't understand why. I've looked for an option to tell mom to not
>> kill the jobs when restarting but I didn't find anything.
>>
>> Does anybody know what is causing the problem or how to solve it ?
>>
>> Thanks in advance.
>>
>> Best Regards,
>> David McGiven
>>
>> P.S : Those are the messages pbs_mom dumps when started after a  
>> crash.
>>
>> 07/20/2009 13:26:25;0001;   pbs_mom;Svr;pbs_mom;No such file or
>> directory (2) in read_config, fstat: config
>> 07/20/2009 13:26:25;0002;   pbs_mom;Svr;setpbsserver;bender
>> 07/20/2009 13:26:25;0002;   pbs_mom;Svr;mom_server_add;server
>> bender       added
>> 07/20/2009 13:26:25;0002;   pbs_mom;n/a;initialize;independent
>> 07/20/2009 13:26:25;0080;   pbs_mom;Svr;pbs_mom;before  
>> init_abort_jobs
>> 07/20/2009 13:26:25;0002;   pbs_mom;Svr;pbs_mom;Is up
>> 07/20/2009 13:26:25;0002;   pbs_mom;Svr;setup_program_environment;MOM
>> executable path and mtime at launch: /opt/soft/torque-
>> 2.3.6/sbin/pbs_mom 1244734930
>> 07/20/2009 13:26:25;0002;   pbs_mom;n/
>> a;mom_server_check_connection;sending hello to server bender
>> 07/20/2009 13:26:26;0008;   pbs_mom;Job;27571.bender   ;kill_task:
>> killing pid 3866 task 1 gracefully with sig 15
>> 07/20/2009 13:26:31;0008;   pbs_mom;Job;27571.bender   ;kill_task:
>> killing pid 3866 task 1 with sig 9
>> 07/20/2009 13:26:31;0008;   pbs_mom;Job;27571.bender   ;kill_task:
>> killing pid 3877 task 1 gracefully with sig 15
>> 07/20/2009 13:26:32;0008;   pbs_mom;Job;27571.bender   ;kill_task:
>> killing pid 3884 task 1 gracefully with sig 15
>> 07/20/2009 13:26:34;0008;   pbs_mom;Job;27565.bender   ;kill_task:
>> killing pid 31771 task 1 gracefully with sig 15
>> 07/20/2009 13:26:39;0008;   pbs_mom;Job;27565.bender   ;kill_task:
>> killing pid 31771 task 1 with sig 9
>> 07/20/2009 13:26:39;0008;   pbs_mom;Job;27565.bender   ;kill_task:
>> killing pid 31782 task 1 gracefully with sig 15
>> 07/20/2009 13:26:39;0008;   pbs_mom;Job;27565.bender   ;kill_task:
>> killing pid 31789 task 1 gracefully with sig 15
>> 07/20/2009 13:26:41;0080;   pbs_mom;Svr;preobit_reply;top of
>> preobit_reply
>> 07/20/2009 13:26:41;0080;   pbs_mom;Svr;preobit_reply;DIS_reply_read/
>> decode_DIS_replySvr worked, top of while loop
>> 07/20/2009 13:26:41;0080;   pbs_mom;Svr;preobit_reply;in while loop,
>> no error from job stat
>> 07/20/2009 13:26:42;0008;   pbs_mom;Job;27571.bender   ;checking job
>> post-processing routine
>> 07/20/2009 13:26:42;0080;   pbs_mom;Job;27571.bender   ;obit sent to
>> server
>> 07/20/2009 13:26:42;0008;   pbs_mom;Job;27586.bender   ;kill_task:
>> killing pid 4918 task 1 gracefully with sig 15
>> 07/20/2009 13:26:47;0008;   pbs_mom;Job;27586.bender   ;kill_task:
>> killing pid 4918 task 1 with sig 9
>> 07/20/2009 13:26:47;0008;   pbs_mom;Job;27586.bender   ;kill_task:
>> killing pid 4929 task 1 gracefully with sig 15
>> 07/20/2009 13:26:47;0008;   pbs_mom;Job;27586.bender   ;kill_task:
>> killing pid 4936 task 1 gracefully with sig 15
>> 07/20/2009 13:26:51;0008;   pbs_mom;Job;27576.bender   ;kill_task:
>> killing pid 4023 task 1 gracefully with sig 15
>> 07/20/2009 13:26:56;0008;   pbs_mom;Job;27576.bender   ;kill_task:
>> killing pid 4023 task 1 with sig 9
>> 07/20/2009 13:26:56;0008;   pbs_mom;Job;27576.bender   ;kill_task:
>> killing pid 4032 task 1 gracefully with sig 15
>> 07/20/2009 13:26:56;0008;   pbs_mom;Job;27576.bender   ;kill_task:
>> killing pid 4039 task 1 gracefully with sig 15
>> 07/20/2009 13:27:01;0008;   pbs_mom;Job;27576.bender   ;kill_task:
>> killing pid 4039 task 1 with sig 9
>> 07/20/2009 13:27:01;0002;   pbs_mom;Svr;im_eof;End of File from addr
>> 158.109.211.63:15001
>> 07/20/2009 13:27:01;0080;   pbs_mom;Svr;preobit_reply;top of
>> preobit_reply
>> 07/20/2009 13:27:01;0080;   pbs_mom;Svr;preobit_reply;DIS_reply_read/
>> decode_DIS_replySvr worked, top of while loop
>> 07/20/2009 13:27:01;0080;   pbs_mom;Svr;preobit_reply;in while loop,
>> no error from job stat
>> 07/20/2009 13:27:01;0080;   pbs_mom;Svr;preobit_reply;top of
>> preobit_reply
>> 07/20/2009 13:27:01;0080;   pbs_mom;Svr;preobit_reply;DIS_reply_read/
>> decode_DIS_replySvr worked, top of while loop
>> 07/20/2009 13:27:01;0080;   pbs_mom;Svr;preobit_reply;in while loop,
>> no error from job stat
>> 07/20/2009 13:27:01;0002;   pbs_mom;n/
>> a;mom_server_check_connection;sending hello to server bender
>> 07/20/2009 13:27:02;0008;   pbs_mom;Job;27565.bender   ;checking job
>> post-processing routine
>>
>>
>> --
>> David McGiven
>> Associate Researcher
>> Universitat Autonoma de Barcelona
>> davidmcgivenn at gmail.com
>>
>>
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>


--
David McGiven
Associate Researcher
Universitat Autonoma de Barcelona
davidmcgivenn at gmail.com





More information about the torqueusers mailing list