[torqueusers] Removing the "exec_host" attribute from a queued job ?

Chris Samuel csamuel at vpac.org
Tue Sep 20 17:53:31 MDT 2005


On Tue, 20 Sep 2005 01:18 pm, Chris Samuel wrote:

>    exec_host = edda010/0+edda007/3+edda007/2+edda007/1
>
> I suspect it's stopping Moab or Torque from running it again on other
> nodes, and I'd like to clear that attribute, but it doesn't appear to be
> accessible through qalter or qmgr.

OK - I think I may have narrowed down what's going on, and it looks like it
could be down to a checkpoint attribute which is set to 'u' on all jobs by
default for some reason. The qsub man page doesn't document the 'u' value,
so I've set it to 'n' instead (which it says means no checkpointing) and
released the scheduler hold on the job.

I'm pretty sure Moab is not the issue, it's happily picking whatever nodes are
free to try and run the job, but Torque (1.2.0p5) is rejecting it because:

09/21 09:28:49 MJobStart(46533,EMsg)
09/21 09:28:49 MRMJobStart(46533,EMsg,SC)
09/21 09:28:49 MPBSJobStart(46533,base,EMsg,SC)
09/21 09:28:49 MPBSJobModify(46533,Resource_List,neednodes,edda018:ppn=2+edda014:ppn=2)
09/21 09:28:49 ERROR:    job '46533' cannot be started: (rc: 15057  errmsg: 'Cannot execute at specified host because of checkpoint or stagein files MSG=cannot assign hosts'  hostlist
: 'edda018:ppn=2+edda014:ppn=2')
09/21 09:28:49 MPBSJobModify(46533,Resource_List,neednodes,)
09/21 09:28:49 ALERT:    job '46533' deferred after 78 failed start attempts (API failure on last attempt)
09/21 09:28:49 ALERT:    job '46533' cannot run (deferring job for 60 seconds)
09/21 09:28:49 MRMJobModify(46533,comment,SC)
09/21 09:28:49 INFO:     cannot annotate job '46533' with message 'cannot start job 46533 - RM failure, rc: 15057, msg: 'Cannot execute at specified host because of checkpoint or stag
ein files MSG=cannot assign hosts''
09/21 09:28:49 INFO:     batch hold placed on job '46533', reason: 'RMFailure'
09/21 09:28:49 MSysRegEvent(JOBHOLD:  batch hold placed on job '46533'.  defercount: 26  reason: 'RMFailure',0,0,1)
09/21 09:28:49 MSysLaunchAction(ASList,)
09/21 09:28:49 ALERT:    cannot run reserved job '46533'
09/21 09:28:49 INFO:     0 reserved jobs started this iteration
09/21 09:28:49 INFO:     total jobs selected in partition ALL: 55/56 [EState: 1]
09/21 09:28:49 INFO:     total jobs selected in partition base: 55/55

Here's hoping..

Chris
-- 
 Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager
 Victorian Partnership for Advanced Computing http://www.vpac.org/
 Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20050921/1e97d3c9/attachment-0001.bin


More information about the torqueusers mailing list