[torqueusers] pbs_mom request, was Re: PBS_MOM kills running jobs when restarted

Wendy Lin hclin at lbl.gov
Thu Dec 10 13:28:35 MST 2009


On Jul 30, 2009, at 3:38 AM, Glen Beane wrote:

> you do not want -p as the default behavior when you reboot a node.
> pbs_mom could find pids that match its last known jobs and attempt to
> take ownership of them when in fact the pids have no relation to the
> previous jobs.  -p should not be in any startup script, except for
> maybe a "restart" option
>
> -p should only be used in rare cases when you need to (re)start
> pbs_mom on a node already running jobs, _never_ at boot time, which is
> why it is not the default and why it is not in most sites startup
> scripts.


This is a very good advice. I'd like to add one very serious side  
effect from abusing the -p.

We are running Torque 2.4.1b1-snap.200905131530 on a very large Cray  
system. After a system wide outage, when everything got started  
afresh, I noticed the jobs that had been active at the time of crash  
all got terminated, even though most of these jobs were marked as  
rerunable. Further investigation indicated since we started pbs_mom  
with "-p", MOM ran the scan_non_child_tasks() to look for lost  
children, did not find them, assumed that they finished, and sent  
obit's to the server. Although the server had requeued the jobs when  
it first started, it purged them in response to the obit's.

So I totally agree with Glen that -p should not be the default.  
Unfortunately, at least with the version of Torque we use, not only -p  
is the default but also there is no way (that I know of) to get back  
the original default behavior, i.e. don't do anything about previous  
jobs when it starts, leave it to the server to decide whether to purge  
or rerun them. I have tried the "-q" setting, it did not do any better.

I saw the latest Torque release is 2.4.2. Does the pbs_mom startup act  
the same wrong way?

-- 
Wendy Lin
hclin at lbl.gov






More information about the torqueusers mailing list