[torqueusers] pbs_mom request, was Re: PBS_MOM kills running jobs when restarted
Wendy Lin
hclin at lbl.gov
Thu Dec 10 13:28:35 MST 2009
On Jul 30, 2009, at 3:38 AM, Glen Beane wrote:
> you do not want -p as the default behavior when you reboot a node.
> pbs_mom could find pids that match its last known jobs and attempt to
> take ownership of them when in fact the pids have no relation to the
> previous jobs. -p should not be in any startup script, except for
> maybe a "restart" option
>
> -p should only be used in rare cases when you need to (re)start
> pbs_mom on a node already running jobs, _never_ at boot time, which is
> why it is not the default and why it is not in most sites startup
> scripts.
This is a very good advice. I'd like to add one very serious side
effect from abusing the -p.
We are running Torque 2.4.1b1-snap.200905131530 on a very large Cray
system. After a system wide outage, when everything got started
afresh, I noticed the jobs that had been active at the time of crash
all got terminated, even though most of these jobs were marked as
rerunable. Further investigation indicated since we started pbs_mom
with "-p", MOM ran the scan_non_child_tasks() to look for lost
children, did not find them, assumed that they finished, and sent
obit's to the server. Although the server had requeued the jobs when
it first started, it purged them in response to the obit's.
So I totally agree with Glen that -p should not be the default.
Unfortunately, at least with the version of Torque we use, not only -p
is the default but also there is no way (that I know of) to get back
the original default behavior, i.e. don't do anything about previous
jobs when it starts, leave it to the server to decide whether to purge
or rerun them. I have tried the "-q" setting, it did not do any better.
I saw the latest Torque release is 2.4.2. Does the pbs_mom startup act
the same wrong way?
--
Wendy Lin
hclin at lbl.gov
More information about the torqueusers
mailing list