[torqueusers] Removing the "exec_host" attribute from a queued job ?

Simon Robbins robbins at physik.uni-wuppertal.de
Tue Sep 20 01:23:19 MDT 2005


Hello,

On Tue, 20 Sep 2005, Chris Samuel wrote:

> Hi folks,
> 
> I've got a job that's queued and obviously tried to start and failed and has 
> ended up with the following attribute set on it:
> 
>    exec_host = edda010/0+edda007/3+edda007/2+edda007/1
> 
> I suspect it's stopping Moab or Torque from running it again on other nodes, 
> and I'd like to clear that attribute, but it doesn't appear to be accessible 
> through qalter or qmgr.
> 
> Any clues ?

Unfortunately no.  I have been seeing this behaviour 
for months now with torque_1.2.0p2,4,5 and 6.  From 
Maui I get:
HostList:
  [n504:1]
Messages:  cannot start job - RM failure, rc: 15041, msg: 
'Execution server rejected request MSG=send failed, STARTING'

Sometimes this is associated with a failure in the 
network.

I always either wait until those nodes eventually 
become free (and the job runs) or ask the user to 
re-submit.  However, sometimes when it attempts to 
start it a second time the same error occurs and I 
have to delete the job.

I've tried things like `qalter -lneednodes=  <jobid>`,
with no effect.

Does anyone else see this behaviour?

Simon.


More information about the torqueusers mailing list