[torquedev] pbs_mom -p and rerunnable jobs

Wendy Lin hclin at lbl.gov
Mon Jan 11 09:38:47 MST 2010

>> From reading it sounds to me like the problem is not that pbs_mom  
>> tries to recover jobs. The problem is that it kills re-runnable  
>> jobs, which have indeed been restarted.
> If pbs_mom is restarted with jobs running and those jobs are still  
> running when pbs_mom is initialized the mom tracks those jobs even  
> though she no longer owns them. This is the behavior we want. Glen  
> points out there is a possibility pbs_mom could pick up the wrong  
> job because the pid of a currently running job is the same as the  
> pid of a previous job. For the case of a pbs_mom restart this is  
> not likely. In the case where a machine is restarted this scenario  
> is more likely. In this case it seems the -q or -r option should be  
> used to tell the mom to discard any TK files. Of course -q and -r  
> could be used on a restart of pbs_mom as well.

 From pbs_mom man page:

       -q              ......  With the -q option,
                        The MOM will mark the jobs as terminated and  
notify the
                        batch server which owns the job......

        -r             ......  With the -r option,
                        MOM will kill any processes belonging to  
jobs, mark the
                        jobs  as  terminated, and notify the batch  
server which
                        owns the job.

I don't see how either "-q" or "-r" can work as the server purges  
these jobs in response to MOM's notifications. The -q was new in  
Torque, so I tested it, and it did not help.

> At this point I am going to work on the problem where re-runnable  
> jobs are killed after they have been restarted. Please respond if  
> there is something wrong with this analysis.

MOM's default behavior got changed at some point in Torque. Would it  
be easier if you just recovered the old code?

Wendy Lin
Computation Systems Group
hclin at lbl.gov

More information about the torquedev mailing list