[torqueusers] Slot limit reached with no jobs running

Andrus, Brian Contractor bdandrus at nps.edu
Tue Mar 4 10:44:01 MST 2014


Having this issue show up again. Only this time there are no jobs within the array currently running on any nodes.

Symptoms:
An array job was submitted, none will start because:
03/04 09:38:51  ERROR:    job '20153277[1]' cannot be started: (rc: 15004  errmsg: 'Invalid request MSG=Cannot run job. Array slot limit is 512 and there are already 512 jobs running

Facts:
There are no jobs running by that user.
There are only 66 procs currently in use on the entire cluster.
Moab (7.2.6) and Torque (4.2.6) have both been restarted on the head node.
In torque:
set server max_slot_limit = 512
When I try to force a run:
[root at hamming jobs]# qrun 20153277[1]
qrun: Invalid request MSG=Cannot run job. Array slot limit is 512 and there are already 512 jobs running
20153277[1].hamming.hamming.cluster

Anyone seen this before?


Brian Andrus
ITACS/Research Computing
Naval Postgraduate School
Monterey, California
voice: 831-656-6238




From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of David Beer
Sent: Thursday, September 19, 2013 8:55 AM
To: Torque Users Mailing List
Subject: Re: [torqueusers] Slot limit unmatched

Sorry, I misread your first post. How was the user's job submitted? Do you have a qstat -f for the job?

On Thu, Sep 19, 2013 at 12:57 AM, Andrus, Brian Contractor <bdandrus at nps.edu<mailto:bdandrus at nps.edu>> wrote:
David,

Yes, As I mentioned in the first post:
I have 'set server max_slot_limit = 512'

Brian Andrus
ITACS/Research Computing
Naval Postgraduate School
Monterey, California
voice: 831-656-6238<tel:831-656-6238>



From: torqueusers-bounces at supercluster.org<mailto:torqueusers-bounces at supercluster.org> [mailto:torqueusers-bounces at supercluster.org<mailto:torqueusers-bounces at supercluster.org>] On Behalf Of David Beer
Sent: Wednesday, September 18, 2013 4:05 PM

To: Torque Users Mailing List
Subject: Re: [torqueusers] Slot limit unmatched

Brian,

What are your qmgr settings? Do you have a slot limit set there?

On Wed, Sep 18, 2013 at 3:34 PM, Andrus, Brian Contractor <bdandrus at nps.edu<mailto:bdandrus at nps.edu>> wrote:
That didn't clear it up.

I did find is that on one of my nodes it showed the job id as 20139590[]
(note the missing arrayid)
There were only 4 jobs from the array on that node, along with some other jobs. I tagged the node offline, let the jobs drain (although it still showed the entire array job) and the ran pbs_mom purge.
After that, I restarted pbs_server and it cleared up.

Of course, now I cannot run any of the jobs that were blocked because "qrun: Execution server rejected request MSG=connection to mom timed out 20139590[1561].hamming.hamming.cluster"
It seems that those jobs want to run on that particular node and nowhere else, but the node is up and happy. It runs other jobs just fine.

I do tend to have difficulties with array jobs and torque. Lots of idiosyncrasies there.


Brian Andrus
ITACS/Research Computing
Naval Postgraduate School
Monterey, California
voice: 831-656-6238<tel:831-656-6238>



From: torqueusers-bounces at supercluster.org<mailto:torqueusers-bounces at supercluster.org> [mailto:torqueusers-bounces at supercluster.org<mailto:torqueusers-bounces at supercluster.org>] On Behalf Of Ken Nielson
Sent: Wednesday, September 18, 2013 9:42 AM
To: Torque Users Mailing List
Subject: Re: [torqueusers] Slot limit unmatched

Brian,
That is a problem. I wonder if you restart pbs_server if the slot limit problem clears up. If so it sounds like we have a counting problem in TORQUE.
Regards

On Wed, Sep 18, 2013 at 9:15 AM, Andrus, Brian Contractor <bdandrus at nps.edu<mailto:bdandrus at nps.edu>> wrote:
All,

I am running torque 4.2.5
I have a user who submitted an array job of ~2500 jobs
I have 'set server max_slot_limit = 512'

But...
There are only 8 of his jobs running, the others are blocked because they sat so long.
Yet if I try to qrun one of them, I get:
        qrun: Invalid request MSG=Cannot run job. Array slot limit is 512 and there are already 512 jobs running

Why does torque think there are 512 slots currently in use when there are only 8?


Brian Andrus
ITACS/Research Computing
Naval Postgraduate School
Monterey, California
voice: 831-656-6238<tel:831-656-6238>


_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org<mailto:torqueusers at supercluster.org>
http://www.supercluster.org/mailman/listinfo/torqueusers



--
Ken Nielson
+1 801.717.3700<tel:%2B1%20801.717.3700> office +1 801.717.3738<tel:%2B1%20801.717.3738> fax
1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
www.adaptivecomputing.com<http://www.adaptivecomputing.com>

_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org<mailto:torqueusers at supercluster.org>
http://www.supercluster.org/mailman/listinfo/torqueusers



--
David Beer | Senior Software Engineer
Adaptive Computing

_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org<mailto:torqueusers at supercluster.org>
http://www.supercluster.org/mailman/listinfo/torqueusers



--
David Beer | Senior Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20140304/108799d7/attachment-0001.html 


More information about the torqueusers mailing list