[torquedev] pbs_mom -p and rerunnable jobs

Ken Nielson knielson at adaptivecomputing.com
Thu Jan 7 16:15:00 MST 2010


Hi all,

I am resurrecting a thread. I have received the following from cray.

On Jul 30, 2009, at 3:38 AM, Glen Beane wrote:

 >you do not want -p as the default behavior when you reboot a node.
 >pbs_mom could find pids that match its last known jobs and attempt to
 >take ownership of them when in fact the pids have no relation to the
 >previous jobs. -p should not be in any startup script, except for
 >maybe a "restart" option

 >-p should only be used in rare cases when you need to (re)start
 >pbs_mom on a node already running jobs, _never_ at boot time, which is
 >why it is not the default and why it is not in most sites startup
 >scripts.

Cray wrote:
 >This is a very good advice. I'd like to add one very serious side
 >effect from abusing the -p.

 >We are running Torque 2.4.1b1-snap.200905131530 on a very large Cray
 >system. After a system wide outage, when everything got started
 >afresh, I noticed the jobs that had been active at the time of crash
 >all got terminated, even though most of these jobs were marked as
 >rerunable. Further investigation indicated since we started pbs_mom
 >with "-p", MOM ran the scan_non_child_tasks() to look for lost
 >children, did not find them, assumed that they finished, and sent
 >obit's to the server. Although the server had requeued the jobs when
 >it first started, it purged them in response to the obit's.

 >So I totally agree with Glen that -p should not be the default.
 >Unfortunately, at least with the version of Torque we use, not only -p
 >is the default but also there is no way (that I know of) to get back
 >the original default behavior, i.e. don't do anything about previous
 >jobs when it starts, leave it to the server to decide whether to purge
 >or rerun them. I have tried the "-q" setting, it did not do any better.

 >I saw the latest Torque release is 2.4.2. Does the pbs_mom startup act
 >the same wrong way?
----

 >In summary, the real issue is that not only -p is the default, but also
 >there is now way to get the original default behavior, which allows 
rerunable
 >jobs to rerun.

 >My posting has been forwarded to the torquedev mailing list, but I've
 >not heard from them yet.

So who gets broken if -p is no longer the default. For Cray there is no 
work around. Unfortunately there is no !-p option. Maybe this is the 
solution. I would prefer pbs_mom not to try and recover jobs unless 
explicitly requested in a command line parameter at startup.

What are your thoughts?

Thanks

Ken Nielson
Adaptive Computing




More information about the torquedev mailing list