[torqueusers] jobs not beeing scheduled but many free slots

Arnau Bria arnaubria at pic.es
Sun Jan 4 07:30:06 MST 2009


On Sat, 03 Jan 2009 18:29:06 +0000
Craig Macdonald wrote:

> I think the problem is that while the node is free, the loadavg on
> the node suggests otherwise:
> 
> pbsnodes reports
> 	loadave=1.64
> 
> maui reports
>       Load: 3.170
Well, I think is not a matter of load:
this is first job in queue:


[root at pbs02 ~]# checkjob 1668114


checking job 1668114

State: Idle
Creds:  user:dteam001  group:dteam  class:short  qos:DEFAULT
WallTime: 00:00:00 of 3:00:00
SubmitTime: Sun Jan  4 15:25:35
  (Time Queued  Total: 00:00:10  Eligible: 00:00:10)

StartDate: 00:00:01  Sun Jan  4 15:25:46
Total Tasks: 1

Req[0]  TaskCount: 1  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [slc4]


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 1
PartitionMask: [ALL]
Flags:       RESTARTABLE

Reservation '1668114' (00:00:01 -> 3:00:01  Duration: 3:00:00)
Messages:  cannot start job - RM failure, rc: 15044, msg: 'Resource temporarily unavailable REJHOST=td057.pic.es MSG=cannot allocate node 'td057.pic.es' to job - node not currently available (nps needed/free: 1/0,  joblist: 1650156.pbs02.pic.es:0,1668063.pbs02.pic.es:1,1667245.pbs02.pic.es:2,1668067.pbs02.pic.es:3,1667246.pbs02.pic.es:4,1667247.pbs02.pic.es:5,1668073.pbs02.pic.es:6,1668084.pbs02.pic.es:7)'
PE:  1.00  StartPriority:  10000
cannot select job 1668114 for partition DEFAULT (startdate in '00:00:01')


which complains about td057:

[root at pbs02 ~]# diagnose -n td057.pic.es
diagnosing node table (5120 slots)
Name                    State  Procs     Memory         Disk          Swap      Speed  Opsys   Arch Par   Load Res Classes                        Network                        Features              

td057.pic.es          Running   4:8    16242:16242   91953:105739  29147:29147   1.00  linux [NONE] DEF   8.00 005 [long_8:8][medium_8:8][short_4 [DEFAULT]                      [slc4][magic]       
-----                     ---   4:8    16242:16242   91953:105739  29147:29147 

Total Nodes: 1  (Active: 1  Idle: 0  Down: 0)

[root at pbs02 ~]# 
[root at pbs02 ~]# diagnose -n td057.pic.es
diagnosing node table (5120 slots)
Name                    State  Procs     Memory         Disk          Swap      Speed  Opsys   Arch Par   Load Res Classes                        Network                        Features              

td057.pic.es          Running   4:8    16242:16242   91953:105739  29147:29147   1.00  linux [NONE] DEF   8.00 005 [long_8:8][medium_8:8][short_4 [DEFAULT]                      [slc4][magic]       
-----                     ---   4:8    16242:16242   91953:105739  29147:29147 

Total Nodes: 1  (Active: 1  Idle: 0  Down: 0)

seems empty, but:


[root at pbs02 ~]# pbsnodes td057.pic.es
td057.pic.es
     state = job-exclusive
     np = 8
     properties = slc4,magic
     ntype = cluster
     jobs = 0/1650156.pbs02.pic.es, 1/1668063.pbs02.pic.es, 2/1667245.pbs02.pic.es, 3/1668067.pbs02.pic.es, 4/1667246.pbs02.pic.es, 5/1667247.pbs02.pic.es, 6/1668073.pbs02.pic.es, 7/1668084.pbs02.pic.es
     status = opsys=linux,uname=Linux td057.pic.es 2.6.9-42.0.3.ELsmp #1 SMP Thu Oct 5 15:04:03 CDT 2006 i686,sessions=11935 21481 21488 21551 27762 27792 27822 27852,nsessions=8,nusers=3,idletime=1122373,totmem=32637840kb,availmem=29843460kb,physmem=16632008kb,ncpus=8,loadave=4.00,gres=cpu_factor:=1.52375,netload=1872480034,size=94159892kb:108277440kb,state=free,jobs=1650156.pbs02.pic.es 1667246.pbs02.pic.es 1667247.pbs02.pic.es 1667245.pbs02.pic.es 1668063.pbs02.pic.es 1668067.pbs02.pic.es 1668073.pbs02.pic.es 1668084.pbs02.pic.es,varattr=,rectime=1231079189

[root at pbs02 ~]# diagnose -n td057.pic.es
diagnosing node table (5120 slots)
Name                    State  Procs     Memory         Disk          Swap      Speed  Opsys   Arch Par   Load Res Classes                        Network                        Features              

td057.pic.es          Running   4:8    16242:16242   91953:105739  29147:29147   1.00  linux [NONE] DEF   8.00 005 [long_8:8][medium_8:8][short_4 [DEFAULT]                      [slc4][magic]       
-----                     ---   4:8    16242:16242   91953:105739  29147:29147 

Total Nodes: 1  (Active: 1  Idle: 0  Down: 0)



so, seems that torque and maui are not seeing same info...

I see no connection errors between both services... really strange.

Cheers,
Arnau




More information about the torqueusers mailing list