[Mauiusers] Maui is not submitting jobs to torque

Alessandra Forti Alessandra.Forti at cern.ch
Wed Oct 10 13:30:49 MDT 2012


Hi,

I have installed a mini test cluster with torque and maui. We have used 
maui/torque for years on our grid cluster and now we are upgrading to 
torque 2.5.7 and maui 3.3-4. Unfortunately with this new combination 
maui doesn't seem to work correctly. When I submit jobs and it behaves 
as if there weren't any free resources. Even when I tried to install 
only torque and maui with a bare minimum configuration I got the same 
behaviour, i.e.

1) When I submit the jobs just remain queued

//[root@//<server> maui]# /qstat -an1//
//
//<server>: //
//Req'd  Req'd   Elap//
//Job ID               Username Queue    Jobname SessID NDS   TSK Memory 
Time  S Time//
//-------------------- -------- -------- ---------------- ------ ----- 
--- ------ ----- - -----//
//10.<server>     aforti   long pbs-vm3.sh          --    --   --    
--    --  Q   --     -- //
//11.s<server>    aforti   long pbs-vm3.sh          --    --   --    
--    --  Q   --     -- /

2) If I run qrun <jobid> the job runs so I assume the problem is not 
between torque server and torque mom.
3) When I use showq on the old versions displayed the WCLimit of the 
default queue now it displays 0 at first and then it changes it by 
itself to 100 days

/showq
ACTIVE JOBS--------------------
JOBNAME            USERNAME      STATE  PROC REMAINING            STARTTIME


      0 Active Jobs       0 of   16 Processors Active (0.00%)
                          0 of    1 Nodes Active      (0.00%)

IDLE JOBS----------------------
JOBNAME            USERNAME      STATE  PROC WCLIMIT            QUEUETIME

2                    aforti       Idle     1 99:23:59:59  Wed Oct 10 
13:36:34
3                    aforti       Idle     1 99:23:59:59  Wed Oct 10 
14:01:43
4                    aforti       Idle     1 99:23:59:59  Wed Oct 10 
18:50:14
5                    aforti       Idle     1    00:00:00  Wed Oct 10 
20:29:27

4 Idle Jobs

BLOCKED JOBS----------------
JOBNAME            USERNAME      STATE  PROC WCLIMIT            QUEUETIME


Total Jobs: 4   Active Jobs: 0   Idle Jobs: 4   Blocked Jobs: 0
//
//
//Total Jobs: 2   Active Jobs: 0   Idle Jobs: 2   Blocked Jobs: 0//
/
4) Checkjob <jobid> just tells me the job cannot be run in the default 
partition without any particular reason

/[.....]
PE:  1.00  StartPriority:  120//
//cannot select job 10 for partition DEFAULT (Class)/

5) Checknode can see the node free if it wasn't clear from other commands

/[root@//<server> maui]# !checkno//
//checknode <node>//
//
//checking node <node>//
//
//State:      Idle  (in current state for 00:55:10)//
//Configured Resources: PROCS: 16  MEM: 23G  SWAP: 31G DISK: 1M//
//Utilized   Resources: SWAP: 202M//
//Dedicated  Resources: [NONE]//
//Opsys:         linux  Arch:      [NONE]//
//Speed:      1.00  Load:       0.000//
//Network:    [DEFAULT]//
//Features:   [lcgpro]//
//Attributes: [Batch]//
//Classes:    [DEFAULT 1:1]//
//
//Total Time: 3:06:35  Up: 3:06:24 (99.90%)  Active: 00:00:10 (0.09%)//
//
//Reservations://
//NOTE:  no reservations on node/

6) When I use showbf -v though it says my nodes are blocked by 
reservations despite checknode clearly telling me there are no 
reservations on that node. In our local maui.cfg there is a reservation 
for 1 proc I'm not sure why it blocks the whole node

/[root@//<server2> server_logs]# showbf -v//
//backfill window (user: 'root' group: 'root' partition: ALL) Tue Oct  9 
17:08:59//
//
//  3 procs available with no timelimit//
//
//node <node2> is blocked by reservation sft.0.0 in INFINITY//
/
But to be sure I removed it and even when I remove the reservation and 
reduce the maui.cfg to the default version without anything in it it 
tells me the node is blocked by "reservation NONE in INFINITY"

/[root@//<server> maui]# showbf -v//
//backfill window (user: 'root' group: 'root' partition: ALL) Tue Oct  9 
17:37:58//
//
// 16 procs available with no timelimit//
//
//node <node> is blocked by reservation NONE in INFINITY//
/
If I increase the maui loglevel to 9 I hundreds of these messages

/10/10 13:37:39 MRMCheckEvents()//
//10/10 13:37:39 INFO:     no PBS sched socket connections ready//
//10/10 13:37:39 MSUAcceptClient(6,ClientSD,HostName,TCP)//
//10/10 13:37:39 INFO:     accept call failed, errno: 11 (Resource 
temporarily unavailable)//
//10/10 13:37:39 INFO:     all clients connected. servicing requests//
/
which leaves me perplexed since in other places with a different log 
level it sees the jobs waiting on the server so somehow some 
comunication happens and other doesn't

/10/10 20:27:24 INFO:     job '2' Priority: 410//
//10/10 20:27:24 INFO:     Cred:      0(00.0) FS:      0(00.0)  
Attr:      0(00.0)  Serv: 410(00.0)  Targ:      0(00.0)  Res:      
0(00.0) Us:      0(00.0)//
//10/10 20:27:24 INFO:     job '2'  priority: 410.30//
//10/10 20:27:24 MJobGetStartPriority(3,0,Priority,NULL)//
//10/10 20:27:24 INFO:     job '3' Priority: 385//
//10/10 20:27:24 INFO:     Cred:      0(00.0) FS:      0(00.0)  
Attr:      0(00.0)  Serv: 385(00.0)  Targ:      0(00.0)  Res:      
0(00.0) Us:      0(00.0)//
//10/10 20:27:24 INFO:     job '3'  priority: 385.30//
//10/10 20:27:24 MJobGetStartPriority(4,0,Priority,NULL)//
//10/10 20:27:24 INFO:     job '4' Priority: 97//
//10/10 20:27:24 INFO:     Cred:      0(00.0) FS:      0(00.0)  
Attr:      0(00.0)  Serv: 97(00.0)  Targ:      0(00.0)  Res:      
0(00.0) Us:      0(00.0)//
//10/10 20:27:24 INFO:     job '4'  priority: 97.17//
/
Thanks for any help here are the rpms I used

/maui-3.3-4.el5//
//maui-client-3.3-4.el5//
//maui-server-3.3-4.el5//
//torque-2.5.7-7.el5//
//torque-client-2.5.7-7.el5//
//torque-server-2.5.7-7.el5//
//libtorque-2.5.7-7.el5//
/
the maui.cfg

/#
# MAUI configuration example
# @(#)maui.cfg David Groep 20031015.1
# for MAUI version 3.2.5
#
SERVERHOST              <server>/
/ADMIN1                  root
ADMINHOST               <server>/
/RMTYPE[0]           PBS
RMHOST[0]           <server>/
/RMSERVER[0]         <server>/
/
SERVERPORT            40559
SERVERMODE            NORMAL

# Set PBS server polling interval. Since we have many short jobs
# and want fast turn-around, set this to 10 seconds (default: 2 minutes)
RMPOLLINTERVAL        00:00:10

# a max. 10 MByte log file in a logical location
LOGFILE               /var/log/maui.log
LOGFILEMAXSIZE        10000000
LOGLEVEL              3/

and Torque config

/create queue long//
//set queue long queue_type = Execution//
//set queue long acl_hosts = localhost//
//set queue long acl_hosts += <server>//
//set queue long resources_max.cput = 48:00:00//
//set queue long resources_max.walltime = 72:00:00//
//set queue long acl_group_enable = True//
//set queue long acl_groups = aforti//
//set queue long enabled = True//
//set queue long started = True//
//#//
//# Set server attributes.//
//#//
//set server scheduling = True//
//set server acl_host_enable = False//
//set server acl_hosts = <server>//
//set server acl_hosts += localhost//
//set server default_queue = long//
//set server log_events = 511//
//set server mail_from = adm//
//set server next_job_number = 12/

-- 
Facts aren't facts if they come from the wrong people. (Paul Krugman)




-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20121010/8e5bc7cc/attachment-0001.html 


More information about the mauiusers mailing list