[torqueusers] Slot limit reached with no jobs running

Ken Nielson knielson at adaptivecomputing.com
Tue Mar 4 11:03:07 MST 2014


Brian,

Have you created a ticket with support for this?


On Tue, Mar 4, 2014 at 10:44 AM, Andrus, Brian Contractor
<bdandrus at nps.edu>wrote:

>  Having this issue show up again. Only this time there are no jobs within
> the array currently running on any nodes.
>
>
>
> Symptoms:
>
> An array job was submitted, none will start because:
>
> 03/04 09:38:51  ERROR:    job '20153277[1]' cannot be started: (rc: 15004
> errmsg: 'Invalid request MSG=Cannot run job. Array slot limit is 512 and
> there are already 512 jobs running
>
>
>
> Facts:
>
> There are no jobs running by that user.
>
> There are only 66 procs currently in use on the entire cluster.
>
> Moab (7.2.6) and Torque (4.2.6) have both been restarted on the head node.
>
> In torque:
>
> set server max_slot_limit = 512
>
> When I try to force a run:
>
> [root at hamming jobs]# qrun 20153277[1]
>
> qrun: Invalid request MSG=Cannot run job. Array slot limit is 512 and
> there are already 512 jobs running
>
> 20153277[1].hamming.hamming.cluster
>
>
>
> Anyone seen this before?
>
>
>
>
>
> Brian Andrus
>
> ITACS/Research Computing
>
> Naval Postgraduate School
>
> Monterey, California
>
> voice: 831-656-6238
>
>
>
>
>
>
>
>
>
> *From:* torqueusers-bounces at supercluster.org [mailto:
> torqueusers-bounces at supercluster.org] *On Behalf Of *David Beer
> *Sent:* Thursday, September 19, 2013 8:55 AM
> *To:* Torque Users Mailing List
> *Subject:* Re: [torqueusers] Slot limit unmatched
>
>
>
> Sorry, I misread your first post. How was the user's job submitted? Do you
> have a qstat -f for the job?
>
>
>
> On Thu, Sep 19, 2013 at 12:57 AM, Andrus, Brian Contractor <
> bdandrus at nps.edu> wrote:
>
> David,
>
>
>
> Yes, As I mentioned in the first post:
>
> I have 'set server max_slot_limit = 512'
>
>
>
> Brian Andrus
>
> ITACS/Research Computing
>
> Naval Postgraduate School
>
> Monterey, California
>
> voice: 831-656-6238
>
>
>
>
>
>
>
> *From:* torqueusers-bounces at supercluster.org [mailto:
> torqueusers-bounces at supercluster.org] *On Behalf Of *David Beer
> *Sent:* Wednesday, September 18, 2013 4:05 PM
>
>
> *To:* Torque Users Mailing List
> *Subject:* Re: [torqueusers] Slot limit unmatched
>
>
>
> Brian,
>
>
>
> What are your qmgr settings? Do you have a slot limit set there?
>
>
>
> On Wed, Sep 18, 2013 at 3:34 PM, Andrus, Brian Contractor <
> bdandrus at nps.edu> wrote:
>
> That didn't clear it up.
>
>
>
> I did find is that on one of my nodes it showed the job id as 20139590[]
>
> (note the missing arrayid)
>
> There were only 4 jobs from the array on that node, along with some other
> jobs. I tagged the node offline, let the jobs drain (although it still
> showed the entire array job) and the ran pbs_mom purge.
>
> After that, I restarted pbs_server and it cleared up.
>
>
>
> Of course, now I cannot run any of the jobs that were blocked because
> "qrun: Execution server rejected request MSG=connection to mom timed out
> 20139590[1561].hamming.hamming.cluster"
>
> It seems that those jobs want to run on that particular node and nowhere
> else, but the node is up and happy. It runs other jobs just fine.
>
>
>
> I do tend to have difficulties with array jobs and torque. Lots of
> idiosyncrasies there.
>
>
>
>
>
> Brian Andrus
>
> ITACS/Research Computing
>
> Naval Postgraduate School
>
> Monterey, California
>
> voice: 831-656-6238
>
>
>
>
>
>
>
> *From:* torqueusers-bounces at supercluster.org [mailto:
> torqueusers-bounces at supercluster.org] *On Behalf Of *Ken Nielson
> *Sent:* Wednesday, September 18, 2013 9:42 AM
> *To:* Torque Users Mailing List
> *Subject:* Re: [torqueusers] Slot limit unmatched
>
>
>
> Brian,
>
> That is a problem. I wonder if you restart pbs_server if the slot limit
> problem clears up. If so it sounds like we have a counting problem in
> TORQUE.
>
> Regards
>
>
>
> On Wed, Sep 18, 2013 at 9:15 AM, Andrus, Brian Contractor <
> bdandrus at nps.edu> wrote:
>
> All,
>
> I am running torque 4.2.5
> I have a user who submitted an array job of ~2500 jobs
> I have 'set server max_slot_limit = 512'
>
> But...
> There are only 8 of his jobs running, the others are blocked because they
> sat so long.
> Yet if I try to qrun one of them, I get:
>         qrun: Invalid request MSG=Cannot run job. Array slot limit is 512
> and there are already 512 jobs running
>
> Why does torque think there are 512 slots currently in use when there are
> only 8?
>
>
> Brian Andrus
> ITACS/Research Computing
> Naval Postgraduate School
> Monterey, California
> voice: 831-656-6238
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
>
> --
> Ken Nielson
> +1 801.717.3700 office +1 801.717.3738 fax
> 1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
> www.adaptivecomputing.com
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
>
>
> --
>
> David Beer | Senior Software Engineer
>
> Adaptive Computing
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
>
>
> --
>
> David Beer | Senior Software Engineer
>
> Adaptive Computing
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>


-- 
Ken Nielson
+1 801.717.3700 office +1 801.717.3738 fax
1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
www.adaptivecomputing.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20140304/abdf9e56/attachment.html 


More information about the torqueusers mailing list