[torqueusers] Removing the "exec_host" attribute from a queued job ?

Dave Jackson jacksond at clusterresources.com
Mon Sep 26 09:35:38 MDT 2005


Richard,

  We have found an alternate source of the 15041 failure associated with
pbs_mom restart using the '-p' flag.  It appears that stale jobs in
certain states can cause a node to stop accepting jobs after a time.  We
have added a patch to address this issue in the latest patch 7 snapshot.

  If anyone is seeing reports of 15041 errors persistently being
reported by a particular node, whether or not the '-p' flag is being
used, please let us know.

Dave
  

On Tue, 2005-09-20 at 09:49 -0500, Richard Walsh wrote:
> Wolfgang Wander wrote:
> 
> >Simon Robbins writes:
> > > 
> > > Hello,
> > > 
> > > On Tue, 20 Sep 2005, Chris Samuel wrote:
> > > 
> > > > Hi folks,
> > > > 
> > > > I've got a job that's queued and obviously tried to start and failed and has 
> > > > ended up with the following attribute set on it:
> > > > 
> > > >    exec_host = edda010/0+edda007/3+edda007/2+edda007/1
> > > > 
> > > > I suspect it's stopping Moab or Torque from running it again on other nodes, 
> > > > and I'd like to clear that attribute, but it doesn't appear to be accessible 
> > > > through qalter or qmgr.
> > > > 
> > > > Any clues ?
> > > 
> > > Unfortunately no.  I have been seeing this behaviour 
> > > for months now with torque_1.2.0p2,4,5 and 6.  From 
> > > Maui I get:
> > > HostList:
> > >   [n504:1]
> > > Messages:  cannot start job - RM failure, rc: 15041, msg: 
> > > 'Execution server rejected request MSG=send failed, STARTING'
> > > 
> > > Sometimes this is associated with a failure in the 
> > > network.
> > > 
> >
> >I've noticed that you can qrun -H [free-node] jobid the job.
> >You'll have to find a [free-node] manually though to make this
> >work...
> >
> >           Wolfgang
> >
> >_______________________________________________
> >torqueusers mailing list
> >torqueusers at supercluster.org
> >http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> >  
> >
> All,
> 
> I had sent a note about this also to the Maui list.  Having read these
> responses, I find that, in my case at least, just a qrun by itself will 
> not get the
> job started.  You have to run a releasehold (a maui command) and
> then a runjob (also a maui command, either -c or -x seems to work). 
> At that point the job seems to be allocated new processors and runs.
> 
> Perhaps this sequence will work in your situtations ...
> 
> Richard Walsh
> AHPCRC
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list