[torqueusers] Re: Job stuck with exec_host set but queued
L.S.Lowe at bham.ac.uk
Fri Mar 6 08:04:16 MST 2009
Hi all, I seem to have got a similar problem to one reported some time
ago: some jobs have gotten into a state where they have exec_host set but
are queued and won't start. Torque is torque-2.3.0-snap.200801151629, maui
is maui-3.2.6p20-snap.1182974819, both standard for the collab I'm in.
The jobs got into that state originally probably because of a temporary
problem on the worker node they were sent to: tracejob says:
03/04/2009 16:46:09 S Job Run at request of root at epgce3.my.domain
03/04/2009 16:46:11 S MOM rejected modify request, error: 15001
03/04/2009 16:46:30 S send of job to epgd16.my.domain failed error = 15020
03/04/2009 16:46:38 S send of job to epgd16.my.domain failed error = 15020
03/04/2009 16:46:38 S unable to run job, MOM rejected/timeout
The job then had acquired an exec_host (epgd16) of the (failed) target
worker node. Since then they have not been schedulable. The important bit
of the maui.log seems to be "Cannot execute at specified host because of
checkpoint or stagein files MSG=allocated nodes must match input file
stagein location. hostlist: epgd01.my.domain". checkjob for those jobs
gives the same sort of message.
Trying to run these by hand (qrun) either to no specific host, or
specifying the host in the job's exec_host, both fail with the same
message that maui got.
This happened to about 5 jobs at more or less the same time, all on the
same exec_host. All these jobs do have stagein requirements. A bad effect
was that maui then stopped scheduling any new job. I've since tweaked maui
with a non-zero DEFERTIME so that it puts those jobs into deferred status
for a while, and carries on with other jobs.
Is there any way out of this problem for jobs like these? Is it fixed in
later torque versions?
Thanks for any help, Lawrence Lowe
bits of maui.log today:
03/06 10:25:49 INFO: tasks located for job 100180: 1 of 1 required (116 feasible)
03/06 10:25:49 MJobStart(100180)
03/06 10:25:49 MJobDistributeTasks(100180,0,NodeList,TaskMap)
03/06 10:25:49 MAMAllocJReserve(100180,RIndex,ErrMsg)
03/06 10:25:49 MRMJobStart(100180,Msg,SC)
03/06 10:25:49 MPBSJobStart(100180,0,Msg,SC)
03/06 10:25:49 MPBSJobModify(100180,Resource_List,Resource,epgd01.my.domain)
03/06 10:25:49 ERROR: job '100180' cannot be started: (rc: 15057
errmsg: 'Cannot execute at specified host because of checkpoint
or stagein files MSG=allocated nodes must match input file stagein
location' hostlist: 'epgd01.my.domain')
03/06 10:25:49 MPBSJobModify(100180,Resource_List,Resource,1)
03/06 10:25:49 ALERT: cannot start job 100180 (RM '0' failed in function 'jobstart')
03/06 10:25:49 WARNING: cannot start job '100180' through resource manager
03/06 10:25:49 ALERT: job '100180' deferred after 15406 failed start attempts (API failure on last attempt)
03/06 10:25:49 MJobSetHold(100180,16,00:00:00,RMFailure,cannot start job -
RM failure, rc: 15057, msg: 'Cannot execute at specified host because of checkpoint or stagein files MSG=allocated nodes must match
input file stagein location')
03/06 10:25:49 INFO: defer disabled
Tel: 0121 414 4621 Fax: 0121 414 6709 Email: L.S.Lowe at bham.ac.uk
On Wed, 14 Sep 2005, Chris Samuel wrote:
> Hi folks,
> I've got a job that's queued and somehow managed to get itself into the state
> where exec_host is set to a list of nodes even though it's still waiting to
> run. It appears that it has attempted to start and failed and now has this
> vestige left and I can't figure out how to remove it!
> Checkjob says:
> RM failure, rc: 15057, msg: 'Cannot execute at specified host because of
> checkpoint or stagein files MSG=cannot assign hosts'
> Not sure quite what that's trying to tell me, I suspect the last part of the
> message is more accurate than the former as neither of the two statements are
> Any clues ?
More information about the torqueusers