[torqueusers] Bug in Torque 1.2.0p6 ?

Jacques Foury Jacques.Foury at math.u-bordeaux1.fr
Tue Jan 24 10:13:04 MST 2006


Hi.

We're running a 6-nodes cluster composed of bi-Opteron computers.

One of our nodes is currently running 3 jobs instead of 2, and we have a 
strange result when typing qstat :

                                                            Req'd  
Req'd   Elap
Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  
S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- 
- -----
7447.ulmo.calcu bouchere infiniCa twi1D2      31413   1  -- 1000mb 2000: 
R 153:5
   callas04/0
7993.ulmo.math. khodor   q1jourCa microf        --   --  -- 1500mb 02:00 
Q   --
   callas04/1
7994.ulmo.math. khodor   q1jourCa nsmgev      30742  --  -- 1500mb 02:00 
R 00:52
   callas04/1

Job 7993 is marked as QUEUED, but has a processor reserved... the same 
processor as 7994 !

but it is actually RUNNING on the node :

ps auxf :

root     16336  0.0  0.0 34520 3944 ?        Ss    2005  31:55 
/usr/sbin/pbs_mom -r
bouchere 31413  0.0  0.1 30984 5440 ?        Ss   Jan18   0:00  \_ -bash
bouchere 31445  0.0  0.1 30988 5448 ?        S    Jan18   0:00  |   \_ -bash
bouchere 31551  0.0  0.0  2336  292 ?        S    Jan18   0:00  |       
\_ time /home/mab/bouchere/THESE/1D/2NIV/TWI_FIN/POLLEN/TWI/run
bouchere 31552 97.8 17.3 713016 704240 ?     R    Jan18 9213:01  
|           \_ /home/mab/bouchere/THESE/1D/2NIV/TWI_FIN/POLLEN/TWI/run
khodor   30527  0.0  0.1 30988 5444 ?        Ss   16:35   0:00  \_ -bash
khodor   30559  0.0  0.1 30992 5452 ?        S    16:35   0:00  |   \_ -bash
khodor    2356  0.0  0.0  2336  292 ?        S    17:14   0:00  |       
\_ time ./microf
khodor    2357 16.1  0.0 12336 3756 ?        R    17:14   7:46  
|           \_ ./microf
khodor   30742  0.0  0.1 30988 5444 ?        Ss   16:36   0:00  \_ -bash
khodor   30774  0.0  0.1 30992 5452 ?        S    16:36   0:00      \_ -bash
khodor   31678  0.0  0.0  2336  292 ?        S    16:40   0:00          
\_ time ./nsmgev
khodor   31679 38.0  0.9 46076 37432 ?       R    16:40  
30:56              \_ ./nsmgev

The nodes file says :

callas01 np=2 opteron callas
callas02 np=2 opteron callas
callas03 np=2 opteron callas
callas04 np=2 opteron callas
callas05 np=2 opteron callas
callas06 np=2 opteron callas

and pbsnodes -a tells there are 2 jobs on the node :

# pbsnodes -a callas04
callas04
     state = job-exclusive
     np = 2
     properties = opteron,callas
     ntype = cluster
     jobs = 0/7447.ulmo.calcul, 1/7994.ulmo.math.u-bordeaux1.fr


What can be wrong ?

-- 
Jacques Foury




More information about the torqueusers mailing list