[torqueusers] Re-executing a qeueued job

Mahmood Naderan nt_mahmood at yahoo.com
Thu Dec 26 07:32:42 MST 2013


The scheduler is Maui, however the job is not defered. Here is the complete log

[mahmood at tiger ~] showq
.......
944 Idle Jobs

BLOCKED JOBS----------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME


Total Jobs: 975   Active Jobs: 31   Idle Jobs: 944   Blocked Jobs: 0





[mahmood at tiger ~]qstat 118077.tiger
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
118077.tiger               streaming        mahmood                0 Q tigerq







[mahmood at tiger ~]$ checkjob 118077.tiger


checking job 118077

State: Idle
Creds:  user:mahmood  group:mahmood  class:tigerq  qos:DEFAULT
WallTime: 00:00:00 of 23:03:33:20
SubmitTime: Thu Dec 26 10:15:11
  (Time Queued  Total: 7:48:46  Eligible: 6:38:24)

StartDate: -7:38:32  Thu Dec 26 10:25:25
Total Tasks: 1

Req[0]  TaskCount: 1  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [NONE]


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 1
PartitionMask: [ALL]
Flags:       RESTARTABLE

Messages:  cannot start job - RM failure, rc: 15046, msg: 'Resource temporarily unavailable MSG=job allocation request exceeds currently available cluster nodes, 1 requested, 0 available'
PE:  1.00  StartPriority:  398
job cannot run in partition DEFAULT (idle procs do not meet requirements : 0 of 1 procs found)
idle procs:  29  feasible procs:   0

Rejection Reasons: [State        :    1]






So, do I have to run "releasehold <jobid>"?

 
Regards,
Mahmood



On Thursday, December 26, 2013 5:57 PM, David Beer <dbeer at adaptivecomputing.com> wrote:
 
If you are using Moab or Maui then they will 'defer' jobs that aren't able to run after a few retries. You probably need to do something like 

releasehold <jobid>

to let the scheduler know its okay to retry job execution again. There is also a parameter to control the amount of time that jobs stay deferred before they are retried again - DEFERTIME. It defaults to 1 hour.




On Thu, Dec 26, 2013 at 7:18 AM, Mahmood Naderan <nt_mahmood at yahoo.com> wrote:

Hi,
>I have submitted some jobs however at the time I submitted them, they were (and still are) in Q state with this reason
>
>
>Messages:  cannot start job - RM failure, rc: 15046, msg: 'Resource temporarily unavailable MSG=job allocation request exceeds currently available cluster nodes, 1 requested, 0 available'
>
>
>
>How can I re-execute the job? Maybe the resource was not available at that time. I can not delete the jobs and resubmit them because a script has generated that.
>
>
>Any way to *retry* the queued job?
> 
>Regards,
>Mahmood
>_______________________________________________
>torqueusers mailing list
>torqueusers at supercluster.org
>http://www.supercluster.org/mailman/listinfo/torqueusers
>
>


-- 

David Beer | Senior Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131226/14bbb173/attachment-0001.html 


More information about the torqueusers mailing list