[torqueusers] Removing the "exec_host" attribute from a queued
rbw at ahpcrc.org
Tue Sep 20 08:49:48 MDT 2005
Wolfgang Wander wrote:
>Simon Robbins writes:
> > Hello,
> > On Tue, 20 Sep 2005, Chris Samuel wrote:
> > > Hi folks,
> > >
> > > I've got a job that's queued and obviously tried to start and failed and has
> > > ended up with the following attribute set on it:
> > >
> > > exec_host = edda010/0+edda007/3+edda007/2+edda007/1
> > >
> > > I suspect it's stopping Moab or Torque from running it again on other nodes,
> > > and I'd like to clear that attribute, but it doesn't appear to be accessible
> > > through qalter or qmgr.
> > >
> > > Any clues ?
> > Unfortunately no. I have been seeing this behaviour
> > for months now with torque_1.2.0p2,4,5 and 6. From
> > Maui I get:
> > HostList:
> > [n504:1]
> > Messages: cannot start job - RM failure, rc: 15041, msg:
> > 'Execution server rejected request MSG=send failed, STARTING'
> > Sometimes this is associated with a failure in the
> > network.
>I've noticed that you can qrun -H [free-node] jobid the job.
>You'll have to find a [free-node] manually though to make this
>torqueusers mailing list
>torqueusers at supercluster.org
I had sent a note about this also to the Maui list. Having read these
responses, I find that, in my case at least, just a qrun by itself will
not get the
job started. You have to run a releasehold (a maui command) and
then a runjob (also a maui command, either -c or -x seems to work).
At that point the job seems to be allocated new processors and runs.
Perhaps this sequence will work in your situtations ...
More information about the torqueusers