[torquedev] pbs_mom -p and rerunnable jobs

Ken Nielson knielson at adaptivecomputing.com
Mon Jan 11 14:23:52 MST 2010


Ken Nielson wrote:
> Wendy Lin wrote:
>   
>> From pbs_mom man page:
>>
>>       -q              ......  With the -q option,
>>                        The MOM will mark the jobs as terminated and 
>> notify the
>>                        batch server which owns the job......
>>
>>        -r             ......  With the -r option,
>>                        MOM will kill any processes belonging to jobs, 
>> mark the
>>                        jobs  as  terminated, and notify the batch 
>> server which
>>                        owns the job.
>>
>>
>> I don't see how either "-q" or "-r" can work as the server purges 
>> these jobs in response to MOM's notifications. The -q was new in 
>> Torque, so I tested it, and it did not help.
>>
>>     
> After studying the code and TORQUE's behavior more I see that the 
> default behavior (-p) will either "adopt" any running jobs or report the 
> end of jobs that do not have pids that match what is in the TK files. In 
> both cases pbs_mom will report the end of the job, create an obituary 
> and pbs_server believes that the jobs ran to completion. At first it 
> looks like TORQUE is deleting re-runnable jobs. But it is merely 
> reporting the jobs finished and removes them from the queue.
>
> If pbs_mom is restarted with the -q option and there are no jobs still 
> running pbs_server will re-queue the jobs so they can be run again 
> later.  If jobs are still running when pbs_mom is started with -q, the 
> jobs are terminated, go to an E state but are then re-queued so they can 
> be run again.
>   
>> MOM's default behavior got changed at some point in Torque. Would it 
>> be easier if you just recovered the old code?
>>
>>     
> At this point I would say that there is some misunderstanding about what 
> the default behavior of pbs_mom should be if pbs_mom is terminated with 
> jobs running. Right now I would say that if you want to be able to rerun 
> jobs you should start pbs_mom with a -r.
Correct that. you should run pbs_mom with a -q. Sorry for not proofing 
my e-mail.

>  If you want the ability to 
> recover jobs which may still be running use the default behavior or -p.
>
> What was the default behavior previously?
>
> Thanks
>
> Ken
> _______________________________________________
> torquedev mailing list
> torquedev at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torquedev
>   



More information about the torquedev mailing list