[torqueusers] Slot limit issues (still)

Ken Nielson knielson at adaptivecomputing.com
Tue Sep 24 10:49:20 MDT 2013


On Tue, Sep 24, 2013 at 10:44 AM, Andrus, Brian Contractor <bdandrus at nps.edu
> wrote:

>  Ken,****
>
> ** **
>
> Yes, they are in a blocked state (batch hold) per section 10-e of the mwm
> documentation:****
>
> In most cases, a job violating these policies is not placed into a batch
> hold immediately; rather, it is deferred. The parameterDEFERTIME<http://docs.adaptivecomputing.com/mwm/Content/a.fparameters.html#defertime> indicates
> how long it is deferred. At this time, it is allowed back into the idle
> queue and again considered for scheduling. If it again is unable to run at
> that time or at any time in the future, it is again deferred for the
> timeframe specified by DEFERTIME. A job is released and deferred up to
> DEFERCOUNT<http://docs.adaptivecomputing.com/mwm/Content/a.fparameters.html#defercount> times
> at which point the scheduler places a batch hold on the job and waits for a
> system administrator to determine the correct course of action. Deferred
> jobs have a Moab state of Deferred. As with jobs in the BatchHold state,
> the reason the job was deferred can be determined by use of the checkjob
>  command.****
>
> At any time, a job can be released from any hold or deferred state using
> the releasehold<http://docs.adaptivecomputing.com/mwm/Content/commands/releasehold.html> command.
> The Moab logs should provide detailed information about the cause of any
> batch hold or job deferral.****
>
> ** **
>
> Oddly, in my case, I can get no info from the checkjob command:****
>
> *State: Hold*
>
> *.*
>
> *.*
>
> *NOTE:  job cannot run  (job has hold in place)*
>
> *BLOCK MSG: non-idle state 'Hold' (recorded at last scheduling iteration)*
>
> ** **
>
> And doing releasehold doesn’t help:****
>
> *[root at hamming ~]# releasehold -a  20139590[1561]*
>
> *holds not modified for job 20139590[1561]  ( hold still in place)*
>
> ** **
>
> So it seems, somehow, somewhere, torque thinks this user/job has 512 slots
> already taken…****
>
> ** **
>
> ** **
>
> Brian Andrus****
>
> ITACS/Research Computing****
>
> Naval Postgraduate School****
>
> Monterey, California****
>
> voice: 831-656-6238****
>
> ** **
>
> ** **
>
> ** **
>
> ** **
>
> *From:* torqueusers-bounces at supercluster.org [mailto:
> torqueusers-bounces at supercluster.org] *On Behalf Of *Ken Nielson
> *Sent:* Tuesday, September 24, 2013 9:14 AM
> *To:* Torque Users Mailing List
> *Subject:* Re: [torqueusers] Slot limit issues (still)****
>
> ** **
>
> ** **
>
> ** **
>
> On Tue, Sep 24, 2013 at 9:30 AM, Andrus, Brian Contractor <
> bdandrus at nps.edu> wrote:****
>
> Ok, This one is still going on with the same array job.****
>
>  ****
>
> I have many array jobs (same parent job) that have gone into a 'blocked'
> status because they couldn't start in a timely manner
> (DEFERTIME/DEFERCOUNT). Not unsual for a sizeable array job with slot
> limits (set server max_slot_limit = 512).****
>
>  ****
>
> So I want to start some of these jobs. The user has NO jobs currently
> running (there ARE other jobs running, only 5 are other array jobs, but a
> different user).****
>
>  ****
>
> I am trying with job 20139590[1561]****
>
> Here is what I try/get:****
>
>  ****
>
> *[root at cluster ~]# qrls 20139590[1561]*****
>
> *[root at cluster ~]# qrun 20139590[1561]*****
>
> *qrun: Invalid request MSG=Cannot run job. Array slot limit is 512 and
> there are already 512 jobs running*****
>
> *20139590[1561].cluster*****
>
> *[root at cluster ~]# qrerun 20139590[1561]*****
>
> *qrerun: Request invalid for state of job MSG=job 20139590[1561].cluster
> is in a bad state 20139590[1561].cluster*****
>
>  ****
>
>  ****
>
> I have tried restarting pbs_server and looked at the output of pbsnodes to
> see if there are any of this job floating around, but there is not. Also
> checked on each node for anything for that job/user.. Nothing there as well.
> ****
>
>  ****
>
> Any ideas what is going on here and/or how to get these jobs running?****
>
>  ****
>
>  ****
>
>  ****
>
> Brian Andrus****
>
> ITACS/Research Computing****
>
> Naval Postgraduate School****
>
> Monterey, California****
>
> voice: 831-656-6238****
>
>  ****
>
> Brian,
>
> I see you are doing a qrls on the job before running the job. So these
> jobs are on hold before they run. Correct?****
>
> Regards ****
>
>
> ****
>
>
> --
> Ken Nielson
> +1 801.717.3700 office +1 801.717.3738 fax
> 1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
> www.adaptivecomputing.com****
>
>
> Brian,

I am just doing some brainstorming. So it sounds like Moab attempted to run
these jobs but for whatever reason TORQUE would not allow them to run and
Moab put a hold on the jobs. Is that correct?



-- 
Ken Nielson
+1 801.717.3700 office +1 801.717.3738 fax
1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
www.adaptivecomputing.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130924/a69dbbc8/attachment.html 


More information about the torqueusers mailing list