[torquedev] Fwd: [torqueusers] pbs_mom request, was Re: PBS_MOM kills running jobs when restarted

Glen Beane glen.beane at gmail.com
Thu Dec 10 17:42:11 MST 2009

forwarded from the torqueuser list.  Does anyone know why the default
behavior for pbs_mom now seems to be "pbs_mom -p".  This is not
desirable, but I think CRI made this change quite a while ago.

---------- Forwarded message ----------
From: Wendy Lin <hclin at lbl.gov>
Date: Thu, Dec 10, 2009 at 3:28 PM
Subject: Re: [torqueusers] pbs_mom request, was Re: PBS_MOM kills
running jobs 	when restarted
To: torqueusers <torqueusers at supercluster.org>

This is a very good advice. I'd like to add one very serious side
effect from abusing the -p.

We are running Torque 2.4.1b1-snap.200905131530 on a very large Cray
system. After a system wide outage, when everything got started
afresh, I noticed the jobs that had been active at the time of crash
all got terminated, even though most of these jobs were marked as
rerunable. Further investigation indicated since we started pbs_mom
with "-p", MOM ran the scan_non_child_tasks() to look for lost
children, did not find them, assumed that they finished, and sent
obit's to the server. Although the server had requeued the jobs when
it first started, it purged them in response to the obit's.

So I totally agree with Glen that -p should not be the default.
Unfortunately, at least with the version of Torque we use, not only -p
is the default but also there is no way (that I know of) to get back
the original default behavior, i.e. don't do anything about previous
jobs when it starts, leave it to the server to decide whether to purge
or rerun them. I have tried the "-q" setting, it did not do any better.

I saw the latest Torque release is 2.4.2. Does the pbs_mom startup act
the same wrong way?

Wendy Lin
hclin at lbl.gov

torqueusers mailing list
torqueusers at supercluster.org

More information about the torquedev mailing list