[torqueusers] Bug in Torque 1.2.0p6 ?
Jacques Foury
Jacques.Foury at math.u-bordeaux1.fr
Tue Jan 24 10:13:04 MST 2006
Hi.
We're running a 6-nodes cluster composed of bi-Opteron computers.
One of our nodes is currently running 3 jobs instead of 2, and we have a
strange result when typing qstat :
Req'd
Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time
S Time
--------------- -------- -------- ---------- ------ --- --- ------ -----
- -----
7447.ulmo.calcu bouchere infiniCa twi1D2 31413 1 -- 1000mb 2000:
R 153:5
callas04/0
7993.ulmo.math. khodor q1jourCa microf -- -- -- 1500mb 02:00
Q --
callas04/1
7994.ulmo.math. khodor q1jourCa nsmgev 30742 -- -- 1500mb 02:00
R 00:52
callas04/1
Job 7993 is marked as QUEUED, but has a processor reserved... the same
processor as 7994 !
but it is actually RUNNING on the node :
ps auxf :
root 16336 0.0 0.0 34520 3944 ? Ss 2005 31:55
/usr/sbin/pbs_mom -r
bouchere 31413 0.0 0.1 30984 5440 ? Ss Jan18 0:00 \_ -bash
bouchere 31445 0.0 0.1 30988 5448 ? S Jan18 0:00 | \_ -bash
bouchere 31551 0.0 0.0 2336 292 ? S Jan18 0:00 |
\_ time /home/mab/bouchere/THESE/1D/2NIV/TWI_FIN/POLLEN/TWI/run
bouchere 31552 97.8 17.3 713016 704240 ? R Jan18 9213:01
| \_ /home/mab/bouchere/THESE/1D/2NIV/TWI_FIN/POLLEN/TWI/run
khodor 30527 0.0 0.1 30988 5444 ? Ss 16:35 0:00 \_ -bash
khodor 30559 0.0 0.1 30992 5452 ? S 16:35 0:00 | \_ -bash
khodor 2356 0.0 0.0 2336 292 ? S 17:14 0:00 |
\_ time ./microf
khodor 2357 16.1 0.0 12336 3756 ? R 17:14 7:46
| \_ ./microf
khodor 30742 0.0 0.1 30988 5444 ? Ss 16:36 0:00 \_ -bash
khodor 30774 0.0 0.1 30992 5452 ? S 16:36 0:00 \_ -bash
khodor 31678 0.0 0.0 2336 292 ? S 16:40 0:00
\_ time ./nsmgev
khodor 31679 38.0 0.9 46076 37432 ? R 16:40
30:56 \_ ./nsmgev
The nodes file says :
callas01 np=2 opteron callas
callas02 np=2 opteron callas
callas03 np=2 opteron callas
callas04 np=2 opteron callas
callas05 np=2 opteron callas
callas06 np=2 opteron callas
and pbsnodes -a tells there are 2 jobs on the node :
# pbsnodes -a callas04
callas04
state = job-exclusive
np = 2
properties = opteron,callas
ntype = cluster
jobs = 0/7447.ulmo.calcul, 1/7994.ulmo.math.u-bordeaux1.fr
What can be wrong ?
--
Jacques Foury
More information about the torqueusers
mailing list