[Mauiusers] maui not scheduling when no resources avaliable

Arnau Bria arnaubria at pic.es
Thu Dec 13 08:05:06 MST 2007


On Thu, 13 Dec 2007 13:59:59 +0100
Jan Ploski wrote:

> mauiusers-bounces at supercluster.org schrieb am 12/13/2007 01:05:06 PM:
> 
> ... 
> > Something similar happened when requesting hosst with "slc3 &&
> > slc4", no nodes fit that condition and maui got hanged....
> > 
> > So, is it a bug?¿ Is anyone having same problem ? any workaround? 
> 
> I once had the same kind of problem - a job stuck in the front of the 
> queue preventing other jobs from executing even though checkjob
> reported "can run" for them. In my case, it was due to an
> inconsistency between Maui's and TORQUE's view of the available
> resources - Maui was trying to assign a job to an already occupied
> resource - because it thought jobs running there each use 0
> processors, TORQUE was rejecting these attempts.
> 
> Maybe the output of diagnose -n <name of the offline node>, diagnose
> -j <job id>, diagnose -r will provide additional clues?

# diagnose -n td248.pic.es
diagnosing node table (5120 slots)
Name                    State  Procs     Memory         Disk          Swap      Speed  Opsys   Arch Par   Load Res Classes                        Network                        Features              

td248.pic.es          Drained   0:10    8006:8006        1:1        4535:4535    1.00  linux [NONE] DEF   0.00 000 [long_10:10][medium_10:10][sho [DEFAULT]                      [slc3]              
-----                     ---   0:10    8006:8006        1:1        4535:4535  

Total Nodes: 1  (Active: 0  Idle: 0  Down: 1)


]# diagnose -j 3446272
Name                  State Par Proc QOS     WCLimit R  Min     User    Group  Account  QueuedTime  Network  Opsys   Arch    Mem   Disk  Procs       Class Features

3446272                Idle ALL    1 DEF  3:00:00:00 0    1 arnaubri     grid        -    00:08:27   [NONE] [NONE] [NONE]    >=0    >=0    NC0    [slc3:1] [slc3]


]# diagnose -r
Diagnosing Reservations
ResID                      Type Par   StartTime     EndTime     Duration Node Task Proc
-----                      ---- ---   ---------     -------     -------- ---- ---- ----
3435058                     Job DEF    -8:41:47    15:18:13   1:00:00:00    1    1    1
    ACL: JOB==3435058= 
    CL:  JOB==3435058 USER==atlas057 GROUP==atlas CLASS==gshort QOS==lhcatlas DURATION==1:00:00:00 PROC==1 
3435059                     Job DEF    -8:41:35    15:18:25   1:00:00:00    1    1    1
    ACL: JOB==3435059= 
    CL:  JOB==3435059 USER==atlas057 GROUP==atlas CLASS==gshort QOS==lhcatlas DURATION==1:00:00:00 PROC==1 
3435060                     Job DEF    -8:40:53    15:19:07   1:00:00:00    1    1    1
    ACL: JOB==3435060= 
    CL:  JOB==3435060 USER==atlas057 GROUP==atlas CLASS==gshort QOS==lhcatlas DURATION==1:00:00:00 PROC==1 
3435062                     Job DEF    -8:41:26    15:18:34   1:00:00:00    1    1    1

[...]

Active Reserved Processors: 218

What exactly are you looking for with "-r"?¿




> Another thing that you might try is setting the node 'down' (kill
> pbs_mom on it) rather than 'offline' to see if it changes anything.

 [root at td248 root]# /etc/init.d/pbs_mom stop
Shutting down TORQUE Mom:                                  [  OK  ]


# pbsnodes td248.pic.es
td248.pic.es
     state = down,offline
     np = 10
     properties = slc3
     ntype = cluster


Not sure if killing pbs_mom did the trick, but now maui is scheduling
fine, event if I set the wn only offline...

]# !pb
pbsnodes td248.pic.es
td248.pic.es
     state = offline
     np = 10
     properties = slc3


> Best regards,
> Jan Ploski
Cheers,
Arnau


More information about the mauiusers mailing list