[torquedev] Fwd: [torqueusers] pbs_mom request, was Re: PBS_MOM kills running jobs when restarted

Garrick Staples garrick at usc.edu
Thu Dec 10 18:00:18 MST 2009


On Thu, Dec 10, 2009 at 07:42:11PM -0500, Glen Beane alleged:
> forwarded from the torqueuser list.  Does anyone know why the default
> behavior for pbs_mom now seems to be "pbs_mom -p".  This is not
> desirable, but I think CRI made this change quite a while ago.

September 2008.

$ svn log -r2443
------------------------------------------------------------------------
r2443 | josh | 2008-09-15 08:58:54 -0700 (Mon, 15 Sep 2008) | 3 lines

FEATURE:        made the behavior of pbs_mom -p now the default when starting pbs_mom--also added a new "-q" option to pbs_mom which will override the new default behavior



I seem to recall arguing against this change, but I don't see anything in the
torqueusers or torquedev archives around that time.

IIRC, the purpose was to allow people to 'restart' pbs_mom without remembering
to use -p.

 
 
> ---------- Forwarded message ----------
> From: Wendy Lin <hclin at lbl.gov>
> Date: Thu, Dec 10, 2009 at 3:28 PM
> Subject: Re: [torqueusers] pbs_mom request, was Re: PBS_MOM kills
> running jobs 	when restarted
> To: torqueusers <torqueusers at supercluster.org>
> 
> 
> This is a very good advice. I'd like to add one very serious side
> effect from abusing the -p.
> 
> We are running Torque 2.4.1b1-snap.200905131530 on a very large Cray
> system. After a system wide outage, when everything got started
> afresh, I noticed the jobs that had been active at the time of crash
> all got terminated, even though most of these jobs were marked as
> rerunable. Further investigation indicated since we started pbs_mom
> with "-p", MOM ran the scan_non_child_tasks() to look for lost
> children, did not find them, assumed that they finished, and sent
> obit's to the server. Although the server had requeued the jobs when
> it first started, it purged them in response to the obit's.
> 
> So I totally agree with Glen that -p should not be the default.
> Unfortunately, at least with the version of Torque we use, not only -p
> is the default but also there is no way (that I know of) to get back
> the original default behavior, i.e. don't do anything about previous
> jobs when it starts, leave it to the server to decide whether to purge
> or rerun them. I have tried the "-q" setting, it did not do any better.
> 
> I saw the latest Torque release is 2.4.2. Does the pbs_mom startup act
> the same wrong way?
> 
> --
> Wendy Lin
> hclin at lbl.gov
> 
> 
> 
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
> _______________________________________________
> torquedev mailing list
> torquedev at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torquedev

-- 
Garrick Staples, GNU/Linux HPCC SysAdmin
University of Southern California

Life is Good!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torquedev/attachments/20091210/b2d8503e/attachment.bin 


More information about the torquedev mailing list