[torqueusers] Removing the "exec_host" attribute from a queued job ?

Richard Walsh rbw at ahpcrc.org
Tue Sep 20 08:49:48 MDT 2005


Wolfgang Wander wrote:

>Simon Robbins writes:
> > 
> > Hello,
> > 
> > On Tue, 20 Sep 2005, Chris Samuel wrote:
> > 
> > > Hi folks,
> > > 
> > > I've got a job that's queued and obviously tried to start and failed and has 
> > > ended up with the following attribute set on it:
> > > 
> > >    exec_host = edda010/0+edda007/3+edda007/2+edda007/1
> > > 
> > > I suspect it's stopping Moab or Torque from running it again on other nodes, 
> > > and I'd like to clear that attribute, but it doesn't appear to be accessible 
> > > through qalter or qmgr.
> > > 
> > > Any clues ?
> > 
> > Unfortunately no.  I have been seeing this behaviour 
> > for months now with torque_1.2.0p2,4,5 and 6.  From 
> > Maui I get:
> > HostList:
> >   [n504:1]
> > Messages:  cannot start job - RM failure, rc: 15041, msg: 
> > 'Execution server rejected request MSG=send failed, STARTING'
> > 
> > Sometimes this is associated with a failure in the 
> > network.
> > 
>
>I've noticed that you can qrun -H [free-node] jobid the job.
>You'll have to find a [free-node] manually though to make this
>work...
>
>           Wolfgang
>
>_______________________________________________
>torqueusers mailing list
>torqueusers at supercluster.org
>http://www.supercluster.org/mailman/listinfo/torqueusers
>
>  
>
All,

I had sent a note about this also to the Maui list.  Having read these
responses, I find that, in my case at least, just a qrun by itself will 
not get the
job started.  You have to run a releasehold (a maui command) and
then a runjob (also a maui command, either -c or -x seems to work). 
At that point the job seems to be allocated new processors and runs.

Perhaps this sequence will work in your situtations ...

Richard Walsh
AHPCRC


More information about the torqueusers mailing list