[torquedev] Fwd: [torqueusers] pbs_mom request, was Re: PBS_MOM kills running jobs when restarted
glen.beane at gmail.com
Thu Dec 10 17:42:11 MST 2009
forwarded from the torqueuser list. Does anyone know why the default
behavior for pbs_mom now seems to be "pbs_mom -p". This is not
desirable, but I think CRI made this change quite a while ago.
---------- Forwarded message ----------
From: Wendy Lin <hclin at lbl.gov>
Date: Thu, Dec 10, 2009 at 3:28 PM
Subject: Re: [torqueusers] pbs_mom request, was Re: PBS_MOM kills
running jobs when restarted
To: torqueusers <torqueusers at supercluster.org>
This is a very good advice. I'd like to add one very serious side
effect from abusing the -p.
We are running Torque 2.4.1b1-snap.200905131530 on a very large Cray
system. After a system wide outage, when everything got started
afresh, I noticed the jobs that had been active at the time of crash
all got terminated, even though most of these jobs were marked as
rerunable. Further investigation indicated since we started pbs_mom
with "-p", MOM ran the scan_non_child_tasks() to look for lost
children, did not find them, assumed that they finished, and sent
obit's to the server. Although the server had requeued the jobs when
it first started, it purged them in response to the obit's.
So I totally agree with Glen that -p should not be the default.
Unfortunately, at least with the version of Torque we use, not only -p
is the default but also there is no way (that I know of) to get back
the original default behavior, i.e. don't do anything about previous
jobs when it starts, leave it to the server to decide whether to purge
or rerun them. I have tried the "-q" setting, it did not do any better.
I saw the latest Torque release is 2.4.2. Does the pbs_mom startup act
the same wrong way?
hclin at lbl.gov
torqueusers mailing list
torqueusers at supercluster.org
More information about the torquedev