[torqueusers] Re: Job stuck with exec_host set but queued

Lawrence Lowe L.S.Lowe at bham.ac.uk
Fri Mar 6 08:04:16 MST 2009


Hi all, I seem to have got a similar problem to one reported some time 
ago: some jobs have gotten into a state where they have exec_host set but 
are queued and won't start. Torque is torque-2.3.0-snap.200801151629, maui 
is maui-3.2.6p20-snap.1182974819, both standard for the collab I'm in.

The jobs got into that state originally probably because of a temporary 
problem on the worker node they were sent to: tracejob says:

03/04/2009 16:46:09  S    Job Run at request of root at epgce3.my.domain
03/04/2009 16:46:11  S    MOM rejected modify request, error: 15001
03/04/2009 16:46:30  S    send of job to epgd16.my.domain failed error = 15020
03/04/2009 16:46:38  S    send of job to epgd16.my.domain failed error = 15020
03/04/2009 16:46:38  S    unable to run job, MOM rejected/timeout

The job then had acquired an exec_host (epgd16) of the (failed) target 
worker node. Since then they have not been schedulable. The important bit 
of the maui.log seems to be "Cannot execute at specified host because of 
checkpoint or stagein files MSG=allocated nodes must match input file 
stagein location. hostlist: epgd01.my.domain". checkjob for those jobs 
gives the same sort of message.

Trying to run these by hand (qrun) either to no specific host, or 
specifying the host in the job's exec_host, both fail with the same 
message that maui got.

This happened to about 5 jobs at more or less the same time, all on the 
same exec_host. All these jobs do have stagein requirements. A bad effect 
was that maui then stopped scheduling any new job. I've since tweaked maui 
with a non-zero DEFERTIME so that it puts those jobs into deferred status 
for a while, and carries on with other jobs.

Is there any way out of this problem for jobs like these? Is it fixed in 
later torque versions?

Thanks for any help, Lawrence Lowe

bits of maui.log today:

03/06 10:25:49 INFO:     tasks located for job 100180:  1 of 1 required (116 feasible)
03/06 10:25:49 MJobStart(100180)
03/06 10:25:49 MJobDistributeTasks(100180,0,NodeList,TaskMap)
03/06 10:25:49 MAMAllocJReserve(100180,RIndex,ErrMsg)
03/06 10:25:49 MRMJobStart(100180,Msg,SC)
03/06 10:25:49 MPBSJobStart(100180,0,Msg,SC)
03/06 10:25:49 MPBSJobModify(100180,Resource_List,Resource,epgd01.my.domain)
03/06 10:25:49 ERROR:    job '100180' cannot be started: (rc: 15057 
errmsg: 'Cannot execute at specified host because of checkpoint
  or stagein files MSG=allocated nodes must match input file stagein 
location'  hostlist: 'epgd01.my.domain')
03/06 10:25:49 MPBSJobModify(100180,Resource_List,Resource,1)
03/06 10:25:49 ALERT:    cannot start job 100180 (RM '0' failed in function 'jobstart')
03/06 10:25:49 WARNING:  cannot start job '100180' through resource manager
03/06 10:25:49 ALERT:    job '100180' deferred after 15406 failed start attempts (API failure on last attempt)
03/06 10:25:49 MJobSetHold(100180,16,00:00:00,RMFailure,cannot start job - 
RM failure, rc: 15057, msg: 'Cannot execute at specified host because of checkpoint or stagein files MSG=allocated nodes must match 
input file stagein location')
03/06 10:25:49 INFO:     defer disabled

Tel: 0121 414 4621    Fax: 0121 414 6709    Email: L.S.Lowe at bham.ac.uk

On Wed, 14 Sep 2005, Chris Samuel wrote:

> Hi folks,
>
> I've got a job that's queued and somehow managed to get itself into the state
> where exec_host is set to a list of nodes even though it's still waiting to
> run.  It appears that it has attempted to start and failed and now has this
> vestige left and I can't figure out how to remove it!
>
> Checkjob says:
>
> RM failure, rc: 15057, msg: 'Cannot execute at specified host because of
> checkpoint or stagein files MSG=cannot assign hosts'
>
> Not sure quite what that's trying to tell me, I suspect the last part of the
> message is more accurate than the former as neither of the two statements are
> correct.
>
> Any clues ?
>
> Chris


More information about the torqueusers mailing list