[torqueusers] Server does not detect node state change for job initiated on that ndoe

Garrick Staples garrick at usc.edu
Thu Feb 17 22:02:33 MST 2005


On Thu, Feb 17, 2005 at 08:01:55PM -0700, David Osguthorpe alleged:
> On Fri, Feb 18, 2005 at 09:02:54AM +1100, Chris Samuel wrote:
> > 
> > > If it was just the MOM that had died, you probably wouldn't want the job
> > > deleted because it's probably running quite happily. ?So the current
> > > behaviour is OK in that case.
> > 
> > Agreed, this also matters in the case of something like an ethernet network 
> > failure on a cluster with some other interconnect, where the parallel jobs 
> > could carry on quite happily whilst the pbs_server is unable to talk to the 
> > mom.  No point marking the job dead until you can say for certain it's gone.
> > 
> 
> The question is which is more likely - do you loose more work by killing jobs
> that may still be working under certain MOM/node fault conditions, particularly
> faults on the primary mother superior MOM - compared to that
> lost in the current situation where jobs really are not working even though
> PBS thinks they are (which means all the jobs nodes are really idle)
> - and there is no notification that there is a problem
> - so far Ive only seen the second occurrence - and note that as far as I can
> see the job would have remained in the system forever - because the walltime
> was not being updated on the server - which would have locked those nodes
> out forever

We all know openpbs/torque has a history of getting jobs into an inconsistent
state and cleaning up after it is often difficult.  I think we can attack this
problem in 3 ways:

 - Work on torque to prevent jobs from getting into inconsistent states.  I've
   been working a lot in this area and my hope is that 1.2.0p1 goes a long way
   towards helping this problem.

 - Work on torque to make the messy cleanups easier.  I personally don't want
   to spend a lot of time because the point above makes this one less
   problematic; and I'd certainly rather just prevent the errors.

 - Come up with monitoring facilities to grab the admin's attention.  This can
   be as simple as some cron scripts.  A good place to start might be
   my (shameless plug) perl module in the torque contrib directory.  Let's keep
   this functionality outside of pbs_server as much as possible.

An added note for the 3rd point there is that now we have the node health check
script in pbs_mom.  For any condition you want, you can have pbs_mom stick an
ERROR message into the node's attributes.  This is trivially retrieved by any PBS
client.  

A useful extension of the node health check feature is in the first attachment
of http://clusterresources.com/bugzilla/show_bug.cgi?id=38 that makes
pbs_server mark a node "down" if it has an ERROR message.  If a node is marked
"down" in the middle of a job, it will continue running but won't *exit* until
the problem is cleared.  IMHO this is the overall most desired behaviour:
pbs_server detects and makes visible any variety of node errors, and leaves the
final fate of the job to the admin.  If the job hasn't already exited, this
ensures that the user has the highest possible chance of getting any output.

Perhaps another possible extension of node health checks is to support a
WARNING message, where it gets into the node attributes but is ignored.

With the node health check feature, a seperate monitoring program (or Moab) is
used to actually inform the admin of ERROR or WARNING messages.


> at the minimum under the situation I had where the server knew it had lost
> contact to the primary mother superior node/MOM the status in the qstat
> should change from R to something else e.g. ? or U (unknown/undetermined) or
> I (indeterminate) another option would be to e-mail the admin user if contact
> is lost to a primary mother superior node/MOM (but not for contact lost to
> other slave nodes)
> 
> maybe this should be a configurable option for the server to be able to delete
> and remove jobs if the primary mother superior MOM is not contactable - it seems
> under the current torque system all job clean up etc. is delegated to the primary
> mother superior MOM so there are multiple problems if the server looses contact
> with the primary MOM/node e.g. the infinite e-mails to the user as the PBS server
> tries to delete the job but never can because the primary MOM is not there when
> the job exceeds its walltime
 
I'm not sure if it could ever be considered safe to kill a job without talking
to MS.


> - the server should probably allow the execution of an "epilogue" script similar
> to what the MOM would do

How about pbs_server just attempts to get pbs_mom restarted to regain
communication?  You could also use "fencing" techniques used by some HA
fileserver implementations (remote power control to "shoot your mom in the
head") (hehe, some FBI system just woke up and is investigating me now because
of that statement).  

All of this would be tricky and probably best done by a seperate monitoring
system.  I don't know about the rest of you, but I don't want pbs_server sending out
any emails more than necessary.  Nagios, anyone?


-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20050217/fdfe825d/attachment-0001.bin


More information about the torqueusers mailing list