[torqueusers] Slot limit unmatched

Andrus, Brian Contractor bdandrus at nps.edu
Wed Sep 18 15:34:40 MDT 2013

That didn't clear it up.

I did find is that on one of my nodes it showed the job id as 20139590[]
(note the missing arrayid)
There were only 4 jobs from the array on that node, along with some other jobs. I tagged the node offline, let the jobs drain (although it still showed the entire array job) and the ran pbs_mom purge.
After that, I restarted pbs_server and it cleared up.

Of course, now I cannot run any of the jobs that were blocked because "qrun: Execution server rejected request MSG=connection to mom timed out 20139590[1561].hamming.hamming.cluster"
It seems that those jobs want to run on that particular node and nowhere else, but the node is up and happy. It runs other jobs just fine.

I do tend to have difficulties with array jobs and torque. Lots of idiosyncrasies there.

Brian Andrus
ITACS/Research Computing
Naval Postgraduate School
Monterey, California
voice: 831-656-6238

From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Ken Nielson
Sent: Wednesday, September 18, 2013 9:42 AM
To: Torque Users Mailing List
Subject: Re: [torqueusers] Slot limit unmatched

That is a problem. I wonder if you restart pbs_server if the slot limit problem clears up. If so it sounds like we have a counting problem in TORQUE.

On Wed, Sep 18, 2013 at 9:15 AM, Andrus, Brian Contractor <bdandrus at nps.edu<mailto:bdandrus at nps.edu>> wrote:

I am running torque 4.2.5
I have a user who submitted an array job of ~2500 jobs
I have 'set server max_slot_limit = 512'

There are only 8 of his jobs running, the others are blocked because they sat so long.
Yet if I try to qrun one of them, I get:
        qrun: Invalid request MSG=Cannot run job. Array slot limit is 512 and there are already 512 jobs running

Why does torque think there are 512 slots currently in use when there are only 8?

Brian Andrus
ITACS/Research Computing
Naval Postgraduate School
Monterey, California
voice: 831-656-6238<tel:831-656-6238>

torqueusers mailing list
torqueusers at supercluster.org<mailto:torqueusers at supercluster.org>

Ken Nielson
+1 801.717.3700 office +1 801.717.3738 fax
1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130918/6001f95f/attachment-0001.html 

More information about the torqueusers mailing list