[torqueusers] Slot limit unmatched

Andrus, Brian Contractor bdandrus at nps.edu
Thu Sep 19 00:57:57 MDT 2013


David,

Yes, As I mentioned in the first post:
I have 'set server max_slot_limit = 512'


Brian Andrus
ITACS/Research Computing
Naval Postgraduate School
Monterey, California
voice: 831-656-6238



From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of David Beer
Sent: Wednesday, September 18, 2013 4:05 PM
To: Torque Users Mailing List
Subject: Re: [torqueusers] Slot limit unmatched

Brian,

What are your qmgr settings? Do you have a slot limit set there?

On Wed, Sep 18, 2013 at 3:34 PM, Andrus, Brian Contractor <bdandrus at nps.edu<mailto:bdandrus at nps.edu>> wrote:
That didn't clear it up.

I did find is that on one of my nodes it showed the job id as 20139590[]
(note the missing arrayid)
There were only 4 jobs from the array on that node, along with some other jobs. I tagged the node offline, let the jobs drain (although it still showed the entire array job) and the ran pbs_mom purge.
After that, I restarted pbs_server and it cleared up.

Of course, now I cannot run any of the jobs that were blocked because "qrun: Execution server rejected request MSG=connection to mom timed out 20139590[1561].hamming.hamming.cluster"
It seems that those jobs want to run on that particular node and nowhere else, but the node is up and happy. It runs other jobs just fine.

I do tend to have difficulties with array jobs and torque. Lots of idiosyncrasies there.


Brian Andrus
ITACS/Research Computing
Naval Postgraduate School
Monterey, California
voice: 831-656-6238<tel:831-656-6238>



From: torqueusers-bounces at supercluster.org<mailto:torqueusers-bounces at supercluster.org> [mailto:torqueusers-bounces at supercluster.org<mailto:torqueusers-bounces at supercluster.org>] On Behalf Of Ken Nielson
Sent: Wednesday, September 18, 2013 9:42 AM
To: Torque Users Mailing List
Subject: Re: [torqueusers] Slot limit unmatched

Brian,
That is a problem. I wonder if you restart pbs_server if the slot limit problem clears up. If so it sounds like we have a counting problem in TORQUE.
Regards

On Wed, Sep 18, 2013 at 9:15 AM, Andrus, Brian Contractor <bdandrus at nps.edu<mailto:bdandrus at nps.edu>> wrote:
All,

I am running torque 4.2.5
I have a user who submitted an array job of ~2500 jobs
I have 'set server max_slot_limit = 512'

But...
There are only 8 of his jobs running, the others are blocked because they sat so long.
Yet if I try to qrun one of them, I get:
        qrun: Invalid request MSG=Cannot run job. Array slot limit is 512 and there are already 512 jobs running

Why does torque think there are 512 slots currently in use when there are only 8?


Brian Andrus
ITACS/Research Computing
Naval Postgraduate School
Monterey, California
voice: 831-656-6238<tel:831-656-6238>


_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org<mailto:torqueusers at supercluster.org>
http://www.supercluster.org/mailman/listinfo/torqueusers



--
Ken Nielson
+1 801.717.3700<tel:%2B1%20801.717.3700> office +1 801.717.3738<tel:%2B1%20801.717.3738> fax
1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
www.adaptivecomputing.com<http://www.adaptivecomputing.com>

_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org<mailto:torqueusers at supercluster.org>
http://www.supercluster.org/mailman/listinfo/torqueusers



--
David Beer | Senior Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130919/ec142504/attachment-0001.html 


More information about the torqueusers mailing list