[Mauiusers] Re: [torqueusers] excessive emails

Dave Jackson jacksond at clusterresources.com
Fri Jul 1 08:41:23 MDT 2005


Garrick,

  This enhancement will address issues seen by a number of sites but you
are correct, it will not address all of them.  We are continuing work in
a number of areas including:

- Adding exponential backoff to Maui for job cancel attempts
- Adding an option to TORQUE to issue an 'asynchronous' cancel which
will cancel the job from pbs_server and, when, if ever, the MOM does
report back to the server, cancels the jobs on the compute nodes.
However, from the time the asynchronous cancel is issued, pbs_server
will no longer report the job.

  Also, while on a somewhat related subject, we are looking into
modifying pbs_mom to use the 'killpg()' system call when terminating a
job.  Currently, mom will issue a SIGTERM to the child process for a
couple of seconds, then issue a SIGKILL.  This killpg() call will send
signals to the child process and the entire process tree (all other
processes in the child's process group).  We are currently using this
capability in some other non-TORQUE projects but would like to know
people's experiences with it.  Is it missing or broken on some
platforms?  Are there 'gotcha' associated with its use on certain
systems?  Are there other comparable projects already utilizing it?

  Please let us know.  If it works, it may be good in decreasing the
number of orphaned processes and possibly extending the usefulness of
TORQUE's suspend/resume based preemption.

Thanks,
Dave

On Thu, 2005-06-30 at 17:28 -0700, Garrick Staples wrote:
> On Thu, Jun 30, 2005 at 04:16:55PM -0600, Michael Musson alleged:
> > All,
> > 
> > Maui has been updated so that if a job is in the exiting state (E), Maui
> > will no longer try to cancel the job.  This should resolve the issue of
> > thousands of emails going out when Maui tries to kill a job that is
> > already exiting.  This change is present in the latest patch14 snapshot.
> 
> I don't think this is going to solve the problem.  When I've seen these loops
> it's because the mom superiour isn't responding.  In that case, the job is
> never put into E state because the mom isn't around to tell pbs_server that the
> job is exiting.
> 
> 
> > Mike M.
> > 
> > On Mon, 2005-06-27 at 09:45 -0700, Garrick Staples wrote:
> > > On Mon, Jun 27, 2005 at 12:50:35PM +0200, Roy Dragseth alleged:
> > > > On Monday 27 June 2005 08:02, Garrick Staples wrote:
> > > > > This is already filed in bugzilla #61.  The general idea is that maui is
> > > > > telling pbs_server to kill a job, but for whatever reason pbs_mom isn't
> > > > > doing it.  The problem is that users are getting an email each time;
> > > > > possibly hundreds of emails.
> > > > >
> > > > > Does anyone have any good ideas on how pbs_server can be smarter about
> > > > > this?
> > > > >
> > > > > I'm thinking that a generalized mail rate limiter can be with a new
> > > > > "minimum time between emails per job" server attribute.  pbs_server could
> > > > > record the timestamp of the last email sent in a new job attribute and
> > > > > refuse to send emails if enough time hasn't elapsed yet.  It is a simple,
> > > > > easily understood mechanism that is trivially coded, but could easily
> > > > > discard useful email.
> > > > >
> > > > > Maybe we could also record the last "reason" and take that into account. 
> > > > > Maybe we could keep counters for each type of email.  I don't know.
> > > > >
> > > > > Anyone else have any ideas?
> > > > 
> > > > I really like the way maui handles this by letting one specify a notification 
> > > > program that takes care of handling the report.   Then I can customize who 
> > > > gets what messages and so on.
> > > 
> > > That's certainly something that could be done.  Do you have any scripts that
> > > ratelimit messages?  Maybe we could adapt
> > > http://dcs.nac.uci.edu/~strombrg/rate-limit.html
> > > 
> > > 
> > > _______________________________________________
> > > torqueusers mailing list
> > > torqueusers at supercluster.org
> > > http://www.supercluster.org/mailman/listinfo/torqueusers
> > 
> > _______________________________________________
> > mauiusers mailing list
> > mauiusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/mauiusers
> 
> _______________________________________________
> mauiusers mailing list
> mauiusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/mauiusers



More information about the mauiusers mailing list