[torquedev] pbs_mom -p and rerunnable jobs

Bas van der Vlies basv at sara.nl
Fri Jan 8 02:06:01 MST 2010


On 8 jan 2010, at 00:35, Martin Siegert wrote:

> Hi Ken,
> 
> On Thu, Jan 07, 2010 at 04:15:00PM -0700, Ken Nielson wrote:
>> Hi all,
>> 
>> I am resurrecting a thread. I have received the following from cray.
>> 
>> On Jul 30, 2009, at 3:38 AM, Glen Beane wrote:
>> 
>>> you do not want -p as the default behavior when you reboot a node.
>>> pbs_mom could find pids that match its last known jobs and attempt to
>>> take ownership of them when in fact the pids have no relation to the
>>> previous jobs. -p should not be in any startup script, except for
>>> maybe a "restart" option
>> 
>>> -p should only be used in rare cases when you need to (re)start
>>> pbs_mom on a node already running jobs, _never_ at boot time, which is
>>> why it is not the default and why it is not in most sites startup
>>> scripts.
>> 
>> Cray wrote:
>>> This is a very good advice. I'd like to add one very serious side
>>> effect from abusing the -p.
>> 
>>> We are running Torque 2.4.1b1-snap.200905131530 on a very large Cray
>>> system. After a system wide outage, when everything got started
>>> afresh, I noticed the jobs that had been active at the time of crash
>>> all got terminated, even though most of these jobs were marked as
>>> rerunable. Further investigation indicated since we started pbs_mom
>>> with "-p", MOM ran the scan_non_child_tasks() to look for lost
>>> children, did not find them, assumed that they finished, and sent
>>> obit's to the server. Although the server had requeued the jobs when
>>> it first started, it purged them in response to the obit's.
>> 
>>> So I totally agree with Glen that -p should not be the default.
>>> Unfortunately, at least with the version of Torque we use, not only -p
>>> is the default but also there is no way (that I know of) to get back
>>> the original default behavior, i.e. don't do anything about previous
>>> jobs when it starts, leave it to the server to decide whether to purge
>>> or rerun them. I have tried the "-q" setting, it did not do any better.
>> 
>>> I saw the latest Torque release is 2.4.2. Does the pbs_mom startup act
>>> the same wrong way?
>> ----
>> 
>>> In summary, the real issue is that not only -p is the default, but also
>>> there is now way to get the original default behavior, which allows 
>> rerunable
>>> jobs to rerun.
>> 
>>> My posting has been forwarded to the torquedev mailing list, but I've
>>> not heard from them yet.
>> 
>> So who gets broken if -p is no longer the default. For Cray there is no 
>> work around. Unfortunately there is no !-p option. Maybe this is the 
>> solution. I would prefer pbs_mom not to try and recover jobs unless 
>> explicitly requested in a command line parameter at startup.
>> 
>> What are your thoughts?
> 
> We have the -p in our init scripts, i.e., we always start/restart
> the pbs_mom with -p. The point is that we restart pbs_mom much
> more often than we reboot nodes and not having -p would have
> disastrous effects. E.g., just recently we upgraded the moms
> in order to use the ignmem feature. But since we have -p in the
> init script we basically don't care whether it is the default or not.
> 
> However, we would like restartable jobs to work regardless of whether
> -p is set or not. I.e., the server should not remove requeued jobs
> that are marked as restartable.
> 
> 

We have to following setup.   We have separate option for starting and restarting pbs_mom in the init.d script.  We leave out the -p option when we start pbs_mom. This feature is used when node reboots/boots. If we restart the pbs_mom we use the -p option.

regards

--
Bas van der Vlies
basv at sara.nl





More information about the torquedev mailing list