[torquedev] pbs_mom -p and rerunnable jobs

Ken Nielson knielson at adaptivecomputing.com
Tue Jan 12 12:24:41 MST 2010


Wendy Lin wrote:
>
> From Torque 2.3.7 man page:
>
>
> ----
> Normally the mini-server is started from the system boot file without 
> the -p or the -r  option. The mini-server will make no attempt to 
> signal the  former  session of any job which may have been running 
> when the mini-server terminated. It is assumed that on reboot,  all 
> processes have been killed.  The MOM will  mark  the  jobs 
>  as terminated  and  notify  the  batch server which owns the job.
> ----
>
> From OpenPBS man page:
>
> ----
> Normally the mini-server is started from the system boot file without 
> the -p or the -r option. The mini-server will make no attempt to 
> signal the former session of any job which may have been running when 
> the mini-server terminated. It is assumed that on reboot, all 
> processes have been killed.
> ----
>
> Looks like Torque added the MOM's sending obit part, which I believe 
> is the problem.
I do not think that sending an obit is a problem. If the mom dies it 
needs to notify the server of what it believes the status of its jobs 
are so the server can update its queue.

It seems we have some middle ground that is not getting covered.

It seems we need an option where the mom will re-queue jobs but not kill 
them and one that will report that  jobs  are done but not adopt them.

Either of these options could be used anytime at the administrators 
discretion but the idea is to use them after a system reboot or if the 
mom has been down for a long time. In either case it is assumed the job 
processes are no longer running. In the first case all jobs will be 
re-queued so they can be run again. In the second case all jobs are 
simply deleted.

Changing default behavior has already been tried and has brought us to 
this point. From the responses I received it sounds like many users have 
come to depend on -p as the default.  I will leave that behavior in 
place and I will add the other two options and make documentation 
available in the man page and on the www.clusterresources.com torque doc 
web site.

Ken Nielson
Adaptive Computing


More information about the torquedev mailing list