[torqueusers] Removing the "exec_host" attribute from a queued job ?

Åke Sandgren ake.sandgren at hpc2n.umu.se
Tue Sep 20 02:02:10 MDT 2005


On Tue, 2005-09-20 at 09:23 +0200, Simon Robbins wrote:
> Hello,
> 
> On Tue, 20 Sep 2005, Chris Samuel wrote:
> 
> > Hi folks,
> > 
> > I've got a job that's queued and obviously tried to start and failed and has 
> > ended up with the following attribute set on it:
> > 
> >    exec_host = edda010/0+edda007/3+edda007/2+edda007/1
> > 
> > I suspect it's stopping Moab or Torque from running it again on other nodes, 
> > and I'd like to clear that attribute, but it doesn't appear to be accessible 
> > through qalter or qmgr.
> > 
> > Any clues ?
> 
> Unfortunately no.  I have been seeing this behaviour 
> for months now with torque_1.2.0p2,4,5 and 6.  From 
> Maui I get:
> HostList:
>   [n504:1]
> Messages:  cannot start job - RM failure, rc: 15041, msg: 
> 'Execution server rejected request MSG=send failed, STARTING'
> 
> Sometimes this is associated with a failure in the 
> network.
> 
> I always either wait until those nodes eventually 
> become free (and the job runs) or ask the user to 
> re-submit.  However, sometimes when it attempts to 
> start it a second time the same error occurs and I 
> have to delete the job.
> 
> I've tried things like `qalter -lneednodes=  <jobid>`,
> with no effect.
> 
> Does anyone else see this behaviour?

The failure is most probably caused by pbs_server timing out too quickly
when waiting for the mom to reply. Are you using ldap-based
passwd/group?

I have a patch to maui (not moab) that makes maui ignore the job's
hostlist (exec_host). It's a flag that can be set on
QOS/ACCOUNT/USER/GROUP/CLASS.


More information about the torqueusers mailing list