[Mauiusers] maui not scheduling on all nodes

Mario Kadastik mario.kadastik at cern.ch
Wed Oct 31 12:28:47 MDT 2012


Hi,

I'm having trouble with maui that's from EMI-1 repository. It namely tends to schedule only up to a certain amount of jobs and then doesn't schedule more jobs even though there are free slots. The maui log shows that it tries to schedule jobs, but fails to make reservations:

10/31 19:49:45 INFO:     162 PBS resources detected on RM base
10/31 19:49:45 INFO:     resources detected: 162
10/31 19:49:45 MPBSWorkloadQuery(base,JCount,SC)
10/31 19:50:06 INFO:     processing node request line '1'
10/31 19:50:06 INFO:     job '1046246' loaded:   1   cms225      cms 259200       Idle   0 1351705749   [NONE] [NONE] [NONE] >=      0 >=      0 [longqueue] 1351705785
10/31 19:50:06 INFO:     processing node request line '1'
10/31 19:50:06 INFO:     job '1046247' loaded:   1   cms225      cms 259200       Idle   0 1351705750   [NONE] [NONE] [NONE] >=      0 >=      0 [longqueue] 1351705785
10/31 19:50:06 INFO:     processing node request line '1'
10/31 19:50:06 INFO:     job '1046248' loaded:   1   cms225      cms 259200       Idle   0 1351705752   [NONE] [NONE] [NONE] >=      0 >=      0 [longqueue] 1351705785
10/31 19:50:06 INFO:     processing node request line '1'
10/31 19:50:06 INFO:     job '1046249' loaded:   1   cms225      cms 259200       Idle   0 1351705756   [NONE] [NONE] [NONE] >=      0 >=      0 [longqueue] 1351705785
10/31 19:50:06 INFO:     processing node request line '1'
10/31 19:50:06 INFO:     job '1046250' loaded:   1   cms225      cms 259200       Idle   0 1351705770   [NONE] [NONE] [NONE] >=      0 >=      0 [longqueue] 1351705785
10/31 19:50:06 INFO:     active PBS job 1041018 has been removed from the queue.  assuming successful completion
10/31 19:50:06 INFO:     active PBS job 1041187 has been removed from the queue.  assuming successful completion
10/31 19:50:06 INFO:     active PBS job 1044863 has been removed from the queue.  assuming successful completion
10/31 19:50:06 INFO:     active PBS job 1044890 has been removed from the queue.  assuming successful completion
10/31 19:50:06 INFO:     active PBS job 1044916 has been removed from the queue.  assuming successful completion
10/31 19:50:06 INFO:     active PBS job 1045212 has been removed from the queue.  assuming successful completion
10/31 19:50:06 INFO:     4982 PBS jobs detected on RM base
10/31 19:50:06 INFO:     jobs detected: 4982
10/31 19:50:07 INFO:     total jobs selected (ALL): 848/4982 [State: 4134]
10/31 19:50:07 INFO:     total jobs selected (ALL): 848/4982 [State: 4134]
10/31 19:50:07 INFO:     total jobs selected in partition ALL: 848/848 
10/31 19:50:07 INFO:     total jobs selected in partition ALL: 848/848 
10/31 19:50:07 INFO:     total jobs selected in partition DEFAULT: 848/848 
10/31 19:50:07 MRMJobStart(1045241,Msg,SC)
10/31 19:50:07 MPBSJobStart(1045241,base,Msg,SC)
10/31 19:50:07 MPBSJobModify(1045241,Resource_List,Resource,wn-v-4196.local)
10/31 19:50:07 MPBSJobModify(1045241,Resource_List,Resource,1)
10/31 19:50:07 INFO:     job '1045241' successfully started
10/31 19:50:07 MRMJobStart(1045242,Msg,SC)
10/31 19:50:07 MPBSJobStart(1045242,base,Msg,SC)
10/31 19:50:07 MPBSJobModify(1045242,Resource_List,Resource,wn-v-6068.local)
10/31 19:50:07 MPBSJobModify(1045242,Resource_List,Resource,1)
10/31 19:50:07 INFO:     job '1045242' successfully started
10/31 19:50:07 ERROR:    cannot create reservation for job '1045242'
10/31 19:50:07 ERROR:    cannot start job '1045242' in partition DEFAULT
10/31 19:50:07 MJobPReserve(1045242,DEFAULT,ResCount,ResCountRej)
10/31 19:50:07 ALERT:    cannot create reservation in MJobReserve
10/31 19:50:07 MJobPReserve(1045243,DEFAULT,ResCount,ResCountRej)
10/31 19:50:07 ALERT:    cannot create reservation in MJobReserve
10/31 19:50:07 MJobPReserve(1045244,DEFAULT,ResCount,ResCountRej)
10/31 19:50:07 ALERT:    cannot create reservation in MJobReserve
10/31 19:50:07 MJobPReserve(1045245,DEFAULT,ResCount,ResCountRej)
10/31 19:50:07 ALERT:    cannot create reservation in MJobReserve
10/31 19:50:07 MJobPReserve(1045247,DEFAULT,ResCount,ResCountRej)
10/31 19:50:07 ALERT:    cannot create reservation in MJobReserve
10/31 19:50:07 MJobPReserve(1045246,DEFAULT,ResCount,ResCountRej)
10/31 19:50:07 ALERT:    cannot create reservation in MJobReserve

The queues show this:
[root at torque-v-1 log]# qstat -q

server: torque-v-1.local

Queue            Memory CPU Time Walltime Node  Run Que Lm  State
---------------- ------ -------- -------- ----  --- --- --  -----
test               --   01:00:00 02:00:00   --    0   0 --   E R
long               --   48:00:00 72:00:00   --  4101 974 --   E R
short              --   01:00:00 02:00:00   --    2   0 --   E R
                                              ----- -----
                                               4103   974
[root at torque-v-1 log]# 

There are free slots however:
[root at torque-v-1 log]# diagnose -t
    DEFAULT [test 5427:5427]

All slots are configured for short and long queue (why they don't show up in diagnose -t is beyond me, but ...). Ideas are welcome. I've seen the scheduling to get stuck at around 3500-3700 running jobs, now after a maintenance downtime where the job count reached 0 this number seems to be around 4100-4300 jobs. I have seen 4930 running jobs a while ago, but that's not been possible recently. 

The maui is: 
[root at torque-v-1 log]# rpm -qa|grep maui
maui-3.2.6p21-snap.1234905291.5.el5
maui-client-3.2.6p21-snap.1234905291.5.el5
maui-server-3.2.6p21-snap.1234905291.5.el5

PS! if you received this twice, sorry ... wasn't sure my original mail got through...

Thanks in advance, 

Mario Kadastik, PhD
Researcher

---
  "Physics is like sex, sure it may have practical reasons, but that's not why we do it" 
     -- Richard P. Feynman



More information about the mauiusers mailing list