[torqueusers] Slot limit unmatched

David Beer dbeer at adaptivecomputing.com
Thu Sep 19 09:55:17 MDT 2013


Sorry, I misread your first post. How was the user's job submitted? Do you
have a qstat -f for the job?


On Thu, Sep 19, 2013 at 12:57 AM, Andrus, Brian Contractor <bdandrus at nps.edu
> wrote:

>  David,****
>
> ** **
>
> Yes, As I mentioned in the first post:****
>
> I have 'set server max_slot_limit = 512'
>
> ****
>
> ** **
>
> Brian Andrus****
>
> ITACS/Research Computing****
>
> Naval Postgraduate School****
>
> Monterey, California****
>
> voice: 831-656-6238****
>
> ** **
>
> ** **
>
> ** **
>
> *From:* torqueusers-bounces at supercluster.org [mailto:
> torqueusers-bounces at supercluster.org] *On Behalf Of *David Beer
> *Sent:* Wednesday, September 18, 2013 4:05 PM
>
> *To:* Torque Users Mailing List
> *Subject:* Re: [torqueusers] Slot limit unmatched****
>
> ** **
>
> Brian,****
>
> ** **
>
> What are your qmgr settings? Do you have a slot limit set there?****
>
> ** **
>
> On Wed, Sep 18, 2013 at 3:34 PM, Andrus, Brian Contractor <
> bdandrus at nps.edu> wrote:****
>
> That didn’t clear it up.****
>
>  ****
>
> I did find is that on one of my nodes it showed the job id as 20139590[]**
> **
>
> (note the missing arrayid)****
>
> There were only 4 jobs from the array on that node, along with some other
> jobs. I tagged the node offline, let the jobs drain (although it still
> showed the entire array job) and the ran pbs_mom purge.****
>
> After that, I restarted pbs_server and it cleared up.****
>
>  ****
>
> Of course, now I cannot run any of the jobs that were blocked because
> “qrun: Execution server rejected request MSG=connection to mom timed out
> 20139590[1561].hamming.hamming.cluster“****
>
> It seems that those jobs want to run on that particular node and nowhere
> else, but the node is up and happy. It runs other jobs just fine.****
>
>  ****
>
> I do tend to have difficulties with array jobs and torque. Lots of
> idiosyncrasies there.****
>
>  ****
>
>  ****
>
> Brian Andrus****
>
> ITACS/Research Computing****
>
> Naval Postgraduate School****
>
> Monterey, California****
>
> voice: 831-656-6238****
>
>  ****
>
>  ****
>
>  ****
>
> *From:* torqueusers-bounces at supercluster.org [mailto:
> torqueusers-bounces at supercluster.org] *On Behalf Of *Ken Nielson
> *Sent:* Wednesday, September 18, 2013 9:42 AM
> *To:* Torque Users Mailing List
> *Subject:* Re: [torqueusers] Slot limit unmatched****
>
>  ****
>
> Brian,****
>
> That is a problem. I wonder if you restart pbs_server if the slot limit
> problem clears up. If so it sounds like we have a counting problem in
> TORQUE.****
>
> Regards****
>
>  ****
>
> On Wed, Sep 18, 2013 at 9:15 AM, Andrus, Brian Contractor <
> bdandrus at nps.edu> wrote:****
>
> All,
>
> I am running torque 4.2.5
> I have a user who submitted an array job of ~2500 jobs
> I have 'set server max_slot_limit = 512'
>
> But...
> There are only 8 of his jobs running, the others are blocked because they
> sat so long.
> Yet if I try to qrun one of them, I get:
>         qrun: Invalid request MSG=Cannot run job. Array slot limit is 512
> and there are already 512 jobs running
>
> Why does torque think there are 512 slots currently in use when there are
> only 8?
>
>
> Brian Andrus
> ITACS/Research Computing
> Naval Postgraduate School
> Monterey, California
> voice: 831-656-6238
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers****
>
>
>
>
> --
> Ken Nielson
> +1 801.717.3700 office +1 801.717.3738 fax
> 1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
> www.adaptivecomputing.com****
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers****
>
>
>
> ****
>
> ** **
>
> -- ****
>
> David Beer | Senior Software Engineer****
>
> Adaptive Computing****
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>


-- 
David Beer | Senior Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130919/3e0f4744/attachment-0001.html 


More information about the torqueusers mailing list