Bug 139 - Negative value in 'Que' when using qstat
: Negative value in 'Que' when using qstat
Status: NEW
Product: TORQUE
pbs_server
: 3.0.x
: PC Linux
: P5 normal
Assigned To: David Beer
:
:
:
  Show dependency treegraph
 
Reported: 2011-06-22 19:43 MDT by pzimdars
Modified: 2012-05-23 01:49 MDT (History)
5 users (show)

See Also:


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description pzimdars 2011-06-22 19:43:27 MDT
New issue with our recently upgraded Torque server. Here is my 'qstat' output:

Queue            Memory CPU Time Walltime Node  Run Que Lm  State
---------------- ------ -------- -------- ----  --- --- --  -----
verylong           --   48:00:00    --      --    0   0 20   D S
hpq                --   24:00:00    --      --    0   0 --   E R
ops                --   24:00:00    --      --    0   0 --   E R
short              --   04:00:00    --      --    0   0 10   E R
long               --   24:00:00    --      --    5  -5 --   E R
amd                --   24:00:00    --      --    2  -1 --   E R
tvac               --   24:00:00    --      --    0   0 --   E R
                                               ----- -----

What I am trying to figure out is why does 'qstat' show negative numbers 
in the 'Que' field. Is this some new "feature"? I don't remember this 
happening on our previous installation. The only way I can remove the 
negative 'Que' values is by restarting pbs_server.

Thanks!
Paul
Comment 1 Nicolas Pinto 2012-01-26 23:36:20 MST
I'm getting the same issue with torque version 3.0.3.

Paul, which version are you using?
Comment 2 Nicolas Pinto 2012-01-26 23:47:46 MST
Quick follow up.

I upgraded to 3.0.4 and "qstat -q" now gives the correct output:

However, the problem persists with "qmgr -c 'list server' | grep state_count":
state_count = Transit:0 Queued:-48 Held:100 Waiting:0 Running:48 Exiting:0
Comment 3 Nicolas Pinto 2012-01-26 23:52:13 MST
After a second look, even "qstat -q" returns the wrong output:

$ qstat -q 
Queue            Memory CPU Time Walltime Node  Run Que Lm  State
---------------- ------ -------- -------- ----  --- --- --  -----
kraken.q           --      --       --      --   47  -1 --   E R
                                               ----- -----
                                                  47    -1

$ qmgr -c 'list server' | grep state_count
state_count = Transit:0 Queued:-101 Held:100 Waiting:0 Running:47 Exiting:0
Comment 4 Victor Gregorio 2012-02-23 11:33:49 MST
Hello, I am also seeing this problem with 2.5.10.  In our case, not only is a
negative number listed in Que, but 2 non-existent jobs are listed in Run.  I am
not sure if these two issues are related, though.

qstat as a privileged user lists zero jobs, but qstat -q shows:

Queue            Memory CPU Time Walltime Node  Run Que Lm  State
---------------- ------ -------- -------- ----  --- --- --  -----
batch              --      --       --      --    2  -4 --   E R
                                               ----- -----
                                                   2    -4

And qmgr -c 'list server' | grep state_count shows:

state_count = Transit:0 Queued:0 Held:-4 Waiting:0 Running:2 Exiting:0 

The only way to clear these erroneous numbers is to restart pbs_server.

Has this issue been resolved in 3.0.4 or 4.0.0?
Comment 5 Nicolas Pinto 2012-02-23 11:55:31 MST
> Has this issue been resolved in 3.0.4 or 4.0.0?

The bug is still present in 3.0.4:

% sudo pbs_server --version
version: 3.0.4

% qmgr -c 'l s' | grep state_count
    state_count = Transit:0 Queued:-22737 Held:22590 Waiting:0 Running:139
Comment 6 Yury V. Zaytsev 2012-03-06 13:08:28 MST
Same problem on my 2.5.9 installation: qterm -t quick and pbs_server -t hot
helps, but only for short time.
Comment 7 shintaro akiyama 2012-05-23 00:08:06 MDT
This bug also occurs in 4.0.2.
when I submit Array jobs.
Comment 8 Yury V. Zaytsev 2012-05-23 01:49:05 MDT
(In reply to comment #7)
> This bug also occurs in 4.0.2.
> when I submit Array jobs.

Now confirmed on 2.5.11, however, it's not as bad as on 2.5.9, because what
2.5.11 does from time to time is the following:

05/18/2012 19:49:10;0001;PBS_Server;Svr;PBS_Server;Job state counts incorrect,
server 0: 0 -17 15 0 2 0 ; queue auto 0 (completed: 0): 0 0 0 0 0 0

... and so the negative values don't linger for long, especially if queues are
quite busy. It is still annoying though...

P.S. Cheers to Riken ;-)