[torquedev] Default job recovery behavior for pbs_mom

Ken Nielson knielson at adaptivecomputing.com
Fri Jan 15 14:57:47 MST 2010


Hi all,

Forgive me for spending so much bandwidth on this problem. However, I 
have a final story on what is to be the default behavior for the MOM 
when it is initialized after going down with running jobs. I'm sorry to 
say there is something not to like about this for everyone. But bugs 
have been fixed and behavior will be known so hopefully any 
inconvenience will be short-lived.

To make sense of the explanation below I need to define two terms. 
Terminate and kill.

When the mom Terminates a job it is deleting its record of the job and 
informing the batch server. Terminate does not apply to a running 
process. The term kill is used to indicate how a running process is 
terminated. Not the job.

In 2.3.x and before by default when the pbs_mom initializes it 
terminates any previously running jobs and informs the batch server.  
Re-runnable jobs are re-queued by the batch server. The MOM is not 
suppose to try and kill any running job processes. Previous to the fix I 
checked in today the MOM would terminate the job and kill any running 
processes by default.

In 2.4.x and beyond the default behavior changed to the -p option which 
was to try and preserve jobs. That is to say that when the mom 
reinitialized it would look for processes with the same pid as any of 
the jobs recovered and then track those jobs assuming they were the same 
running processes that existed before the MOM shut down.

The -q flag was added to 2.4.x to allow users to create the 2.3.x 
default functionality.

The -r flag terminates all jobs that were running when the mom shut down 
and then kills any running processes with a pid that matches the pid of 
the recovered jobs. Re-runnable jobs are re-queued by the batch server.

I added a -P (cap P) option to 2.4.x which is similar to the -p option. 
The difference is that the -P option terminates all jobs and does not 
try to recover running processes.

I have updated the pbs_mom man pages for both 2.3 and 2.4 versions.

I have created a snapshot for 2.3 and one for 2.4. They can be found at 
the following:

http://www.clusterresources.com/downloads/torque/snapshots/torque-2.3.10-snap.201001151340.tar.gz 
<http://www.clusterresources.com/downloads/torque/snapshots/torque-2.3.10-snap.201001151340.tar.gz> 


http://www.clusterresources.com/downloads/torque/snapshots/torque-2.4.5-snap.201001151416.tar.gz 
<http://www.clusterresources.com/downloads/torque/snapshots/torque-2.4.5-snap.201001151416.tar.gz>

Please feel free to download these and try them out. Any feedback is 
welcome.

Regards

Ken Nielson
Adaptive Computing


More information about the torquedev mailing list