[torqueusers] jobs not beeing scheduled but many free slots
Arnau Bria
arnaubria at pic.es
Sun Jan 4 07:30:06 MST 2009
On Sat, 03 Jan 2009 18:29:06 +0000
Craig Macdonald wrote:
> I think the problem is that while the node is free, the loadavg on
> the node suggests otherwise:
>
> pbsnodes reports
> loadave=1.64
>
> maui reports
> Load: 3.170
Well, I think is not a matter of load:
this is first job in queue:
[root at pbs02 ~]# checkjob 1668114
checking job 1668114
State: Idle
Creds: user:dteam001 group:dteam class:short qos:DEFAULT
WallTime: 00:00:00 of 3:00:00
SubmitTime: Sun Jan 4 15:25:35
(Time Queued Total: 00:00:10 Eligible: 00:00:10)
StartDate: 00:00:01 Sun Jan 4 15:25:46
Total Tasks: 1
Req[0] TaskCount: 1 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [slc4]
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 1
PartitionMask: [ALL]
Flags: RESTARTABLE
Reservation '1668114' (00:00:01 -> 3:00:01 Duration: 3:00:00)
Messages: cannot start job - RM failure, rc: 15044, msg: 'Resource temporarily unavailable REJHOST=td057.pic.es MSG=cannot allocate node 'td057.pic.es' to job - node not currently available (nps needed/free: 1/0, joblist: 1650156.pbs02.pic.es:0,1668063.pbs02.pic.es:1,1667245.pbs02.pic.es:2,1668067.pbs02.pic.es:3,1667246.pbs02.pic.es:4,1667247.pbs02.pic.es:5,1668073.pbs02.pic.es:6,1668084.pbs02.pic.es:7)'
PE: 1.00 StartPriority: 10000
cannot select job 1668114 for partition DEFAULT (startdate in '00:00:01')
which complains about td057:
[root at pbs02 ~]# diagnose -n td057.pic.es
diagnosing node table (5120 slots)
Name State Procs Memory Disk Swap Speed Opsys Arch Par Load Res Classes Network Features
td057.pic.es Running 4:8 16242:16242 91953:105739 29147:29147 1.00 linux [NONE] DEF 8.00 005 [long_8:8][medium_8:8][short_4 [DEFAULT] [slc4][magic]
----- --- 4:8 16242:16242 91953:105739 29147:29147
Total Nodes: 1 (Active: 1 Idle: 0 Down: 0)
[root at pbs02 ~]#
[root at pbs02 ~]# diagnose -n td057.pic.es
diagnosing node table (5120 slots)
Name State Procs Memory Disk Swap Speed Opsys Arch Par Load Res Classes Network Features
td057.pic.es Running 4:8 16242:16242 91953:105739 29147:29147 1.00 linux [NONE] DEF 8.00 005 [long_8:8][medium_8:8][short_4 [DEFAULT] [slc4][magic]
----- --- 4:8 16242:16242 91953:105739 29147:29147
Total Nodes: 1 (Active: 1 Idle: 0 Down: 0)
seems empty, but:
[root at pbs02 ~]# pbsnodes td057.pic.es
td057.pic.es
state = job-exclusive
np = 8
properties = slc4,magic
ntype = cluster
jobs = 0/1650156.pbs02.pic.es, 1/1668063.pbs02.pic.es, 2/1667245.pbs02.pic.es, 3/1668067.pbs02.pic.es, 4/1667246.pbs02.pic.es, 5/1667247.pbs02.pic.es, 6/1668073.pbs02.pic.es, 7/1668084.pbs02.pic.es
status = opsys=linux,uname=Linux td057.pic.es 2.6.9-42.0.3.ELsmp #1 SMP Thu Oct 5 15:04:03 CDT 2006 i686,sessions=11935 21481 21488 21551 27762 27792 27822 27852,nsessions=8,nusers=3,idletime=1122373,totmem=32637840kb,availmem=29843460kb,physmem=16632008kb,ncpus=8,loadave=4.00,gres=cpu_factor:=1.52375,netload=1872480034,size=94159892kb:108277440kb,state=free,jobs=1650156.pbs02.pic.es 1667246.pbs02.pic.es 1667247.pbs02.pic.es 1667245.pbs02.pic.es 1668063.pbs02.pic.es 1668067.pbs02.pic.es 1668073.pbs02.pic.es 1668084.pbs02.pic.es,varattr=,rectime=1231079189
[root at pbs02 ~]# diagnose -n td057.pic.es
diagnosing node table (5120 slots)
Name State Procs Memory Disk Swap Speed Opsys Arch Par Load Res Classes Network Features
td057.pic.es Running 4:8 16242:16242 91953:105739 29147:29147 1.00 linux [NONE] DEF 8.00 005 [long_8:8][medium_8:8][short_4 [DEFAULT] [slc4][magic]
----- --- 4:8 16242:16242 91953:105739 29147:29147
Total Nodes: 1 (Active: 1 Idle: 0 Down: 0)
so, seems that torque and maui are not seeing same info...
I see no connection errors between both services... really strange.
Cheers,
Arnau
More information about the torqueusers
mailing list