[torqueusers] Removing the "exec_host" attribute from a
queued job ?
garrick at usc.edu
Tue Sep 20 18:22:23 MDT 2005
On Wed, Sep 21, 2005 at 09:53:31AM +1000, Chris Samuel alleged:
> On Tue, 20 Sep 2005 01:18 pm, Chris Samuel wrote:
> > ?? ??exec_host = edda010/0+edda007/3+edda007/2+edda007/1
> > I suspect it's stopping Moab or Torque from running it again on other
> > nodes, and I'd like to clear that attribute, but it doesn't appear to be
> > accessible through qalter or qmgr.
> OK - I think I may have narrowed down what's going on, and it looks like it
> could be down to a checkpoint attribute which is set to 'u' on all jobs by
> default for some reason. The qsub man page doesn't document the 'u' value,
> so I've set it to 'n' instead (which it says means no checkpointing) and
> released the scheduler hold on the job.
> I'm pretty sure Moab is not the issue, it's happily picking whatever nodes are
> free to try and run the job, but Torque (1.2.0p5) is rejecting it because:
> 09/21 09:28:49 MJobStart(46533,EMsg)
> 09/21 09:28:49 MRMJobStart(46533,EMsg,SC)
> 09/21 09:28:49 MPBSJobStart(46533,base,EMsg,SC)
> 09/21 09:28:49 MPBSJobModify(46533,Resource_List,neednodes,edda018:ppn=2+edda014:ppn=2)
> 09/21 09:28:49 ERROR: job '46533' cannot be started: (rc: 15057 errmsg: 'Cannot execute at specified host because of checkpoint or stagein files MSG=cannot assign hosts' hostlist
> : 'edda018:ppn=2+edda014:ppn=2')
> 09/21 09:28:49 MPBSJobModify(46533,Resource_List,neednodes,)
> 09/21 09:28:49 ALERT: job '46533' deferred after 78 failed start attempts (API failure on last attempt)
> 09/21 09:28:49 ALERT: job '46533' cannot run (deferring job for 60 seconds)
> 09/21 09:28:49 MRMJobModify(46533,comment,SC)
> 09/21 09:28:49 INFO: cannot annotate job '46533' with message 'cannot start job 46533 - RM failure, rc: 15057, msg: 'Cannot execute at specified host because of checkpoint or stag
> ein files MSG=cannot assign hosts''
> 09/21 09:28:49 INFO: batch hold placed on job '46533', reason: 'RMFailure'
> 09/21 09:28:49 MSysRegEvent(JOBHOLD: batch hold placed on job '46533'. defercount: 26 reason: 'RMFailure',0,0,1)
> 09/21 09:28:49 MSysLaunchAction(ASList,)
> 09/21 09:28:49 ALERT: cannot run reserved job '46533'
> 09/21 09:28:49 INFO: 0 reserved jobs started this iteration
> 09/21 09:28:49 INFO: total jobs selected in partition ALL: 55/56 [EState: 1]
> 09/21 09:28:49 INFO: total jobs selected in partition base: 55/55
Sounds like the initially started job got to the point where it had
copied the input files for the job before failing. It's worth
discovering if the original MS node got a job start commit.
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20050920/ae604913/attachment.bin
More information about the torqueusers