[torqueusers] Removing the "exec_host" attribute from a queued
job ?
Simon Robbins
robbins at physik.uni-wuppertal.de
Tue Sep 20 01:23:19 MDT 2005
Hello,
On Tue, 20 Sep 2005, Chris Samuel wrote:
> Hi folks,
>
> I've got a job that's queued and obviously tried to start and failed and has
> ended up with the following attribute set on it:
>
> exec_host = edda010/0+edda007/3+edda007/2+edda007/1
>
> I suspect it's stopping Moab or Torque from running it again on other nodes,
> and I'd like to clear that attribute, but it doesn't appear to be accessible
> through qalter or qmgr.
>
> Any clues ?
Unfortunately no. I have been seeing this behaviour
for months now with torque_1.2.0p2,4,5 and 6. From
Maui I get:
HostList:
[n504:1]
Messages: cannot start job - RM failure, rc: 15041, msg:
'Execution server rejected request MSG=send failed, STARTING'
Sometimes this is associated with a failure in the
network.
I always either wait until those nodes eventually
become free (and the job runs) or ask the user to
re-submit. However, sometimes when it attempts to
start it a second time the same error occurs and I
have to delete the job.
I've tried things like `qalter -lneednodes= <jobid>`,
with no effect.
Does anyone else see this behaviour?
Simon.
More information about the torqueusers
mailing list