[torqueusers] Removing the "exec_host" attribute from a queued
job ?
Åke Sandgren
ake.sandgren at hpc2n.umu.se
Tue Sep 20 02:02:10 MDT 2005
On Tue, 2005-09-20 at 09:23 +0200, Simon Robbins wrote:
> Hello,
>
> On Tue, 20 Sep 2005, Chris Samuel wrote:
>
> > Hi folks,
> >
> > I've got a job that's queued and obviously tried to start and failed and has
> > ended up with the following attribute set on it:
> >
> > exec_host = edda010/0+edda007/3+edda007/2+edda007/1
> >
> > I suspect it's stopping Moab or Torque from running it again on other nodes,
> > and I'd like to clear that attribute, but it doesn't appear to be accessible
> > through qalter or qmgr.
> >
> > Any clues ?
>
> Unfortunately no. I have been seeing this behaviour
> for months now with torque_1.2.0p2,4,5 and 6. From
> Maui I get:
> HostList:
> [n504:1]
> Messages: cannot start job - RM failure, rc: 15041, msg:
> 'Execution server rejected request MSG=send failed, STARTING'
>
> Sometimes this is associated with a failure in the
> network.
>
> I always either wait until those nodes eventually
> become free (and the job runs) or ask the user to
> re-submit. However, sometimes when it attempts to
> start it a second time the same error occurs and I
> have to delete the job.
>
> I've tried things like `qalter -lneednodes= <jobid>`,
> with no effect.
>
> Does anyone else see this behaviour?
The failure is most probably caused by pbs_server timing out too quickly
when waiting for the mom to reply. Are you using ldap-based
passwd/group?
I have a patch to maui (not moab) that makes maui ignore the job's
hostlist (exec_host). It's a flag that can be set on
QOS/ACCOUNT/USER/GROUP/CLASS.
More information about the torqueusers
mailing list