[torqueusers] Re-executing a qeueued job

David Beer dbeer at adaptivecomputing.com
Mon Dec 30 07:50:28 MST 2013


If the job isn't deferred or held then it should be re-tried automatically.
If it becomes deferred or held then releasehold will force it to be retried.

David


On Thu, Dec 26, 2013 at 7:32 AM, Mahmood Naderan <nt_mahmood at yahoo.com>wrote:

> The scheduler is Maui, however the job is not defered. Here is the
> complete log
>
> [mahmood at tiger ~] showq
> .......
> 944 Idle Jobs
>
> BLOCKED JOBS----------------
> JOBNAME            USERNAME      STATE  PROC     WCLIMIT
> QUEUETIME
>
>
> Total Jobs: 975   Active Jobs: 31   Idle Jobs: 944   Blocked Jobs: 0
>
>
>
>
>
> [mahmood at tiger ~]qstat 118077.tiger
> Job id                    Name             User            Time Use S Queue
> ------------------------- ---------------- --------------- -------- - -----
> 118077.tiger               streaming        mahmood                0 Q
> tigerq
>
>
>
>
>
>
>
> [mahmood at tiger ~]$ checkjob 118077.tiger
>
>
> checking job 118077
>
> State: Idle
> Creds:  user:mahmood  group:mahmood  class:tigerq  qos:DEFAULT
> WallTime: 00:00:00 of 23:03:33:20
> SubmitTime: Thu Dec 26 10:15:11
>   (Time Queued  Total: 7:48:46  Eligible: 6:38:24)
>
> StartDate: -7:38:32  Thu Dec 26 10:25:25
> Total Tasks: 1
>
> Req[0]  TaskCount: 1  Partition: ALL
> Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
> Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
>
>
> IWD: [NONE]  Executable:  [NONE]
> Bypass: 0  StartCount: 1
> PartitionMask: [ALL]
> Flags:       RESTARTABLE
>
>
> Messages:  cannot start job - RM failure, rc: 15046, msg: 'Resource
> temporarily unavailable MSG=job allocation request exceeds currently
> available cluster nodes, 1 requested, 0 available'
> PE:  1.00  StartPriority:  398
> job cannot run in partition DEFAULT (idle procs do not meet requirements :
> 0 of 1 procs found)
> idle procs:  29  feasible procs:   0
>
> Rejection Reasons: [State        :    1]
>
>
>
>
>
>
> So, do I have to run "releasehold <jobid>"?
>
>
> Regards,
> Mahmood
>
>
>   On Thursday, December 26, 2013 5:57 PM, David Beer <
> dbeer at adaptivecomputing.com> wrote:
>  If you are using Moab or Maui then they will 'defer' jobs that aren't
> able to run after a few retries. You probably need to do something like
>
> releasehold <jobid>
>
> to let the scheduler know its okay to retry job execution again. There is
> also a parameter to control the amount of time that jobs stay deferred
> before they are retried again - DEFERTIME. It defaults to 1 hour.
>
>
> On Thu, Dec 26, 2013 at 7:18 AM, Mahmood Naderan <nt_mahmood at yahoo.com>wrote:
>
> Hi,
> I have submitted some jobs however at the time I submitted them, they were
> (and still are) in Q state with this reason
>
> Messages:  cannot start job - RM failure, rc: 15046, msg: 'Resource
> temporarily unavailable MSG=job allocation request exceeds currently
> available cluster nodes, 1 requested, 0 available'
>
> How can I re-execute the job? Maybe the resource was not available at that
> time. I can not delete the jobs and resubmit them because a script has
> generated that.
>
> Any way to *retry* the queued job?
>
> Regards,
> Mahmood
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
>
> --
> David Beer | Senior Software Engineer
> Adaptive Computing
>
>
>


-- 
David Beer | Senior Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131230/2a8ac0eb/attachment.html 


More information about the torqueusers mailing list