[torquedev] pbs_mom -p and rerunnable jobs

Wendy Lin hclin at lbl.gov
Mon Jan 11 09:38:47 MST 2010


>> From reading it sounds to me like the problem is not that pbs_mom  
>> tries to recover jobs. The problem is that it kills re-runnable  
>> jobs, which have indeed been restarted.
>
> If pbs_mom is restarted with jobs running and those jobs are still  
> running when pbs_mom is initialized the mom tracks those jobs even  
> though she no longer owns them. This is the behavior we want. Glen  
> points out there is a possibility pbs_mom could pick up the wrong  
> job because the pid of a currently running job is the same as the  
> pid of a previous job. For the case of a pbs_mom restart this is  
> not likely. In the case where a machine is restarted this scenario  
> is more likely. In this case it seems the -q or -r option should be  
> used to tell the mom to discard any TK files. Of course -q and -r  
> could be used on a restart of pbs_mom as well.


 From pbs_mom man page:

       -q              ......  With the -q option,
                        The MOM will mark the jobs as terminated and  
notify the
                        batch server which owns the job......

        -r             ......  With the -r option,
                        MOM will kill any processes belonging to  
jobs, mark the
                        jobs  as  terminated, and notify the batch  
server which
                        owns the job.


I don't see how either "-q" or "-r" can work as the server purges  
these jobs in response to MOM's notifications. The -q was new in  
Torque, so I tested it, and it did not help.


> At this point I am going to work on the problem where re-runnable  
> jobs are killed after they have been restarted. Please respond if  
> there is something wrong with this analysis.


MOM's default behavior got changed at some point in Torque. Would it  
be easier if you just recovered the old code?

-- 
Wendy Lin
Computation Systems Group
hclin at lbl.gov





More information about the torquedev mailing list