[Mauiusers] Maui problem

Alessandra Forti Alessandra.Forti at cern.ch
Tue Oct 9 11:04:26 MDT 2012


I have installed a mini test cluster with torque and maui. We have used 
maui/torque for years on our grid cluster and now we are upgrading to 
torque 2.5.7 and maui 3.3-4. Unfortunately with this new combination 
maui doesn't seem to work correctly. When I submit jobs and it behaves 
as if there weren't any free resources. Even when I tried to install 
only torque and maui with a bare minimum configuration I got the same 
behaviour, i.e.

1) When I submit the jobs just remain queued

//[root@//<server> maui]# /qstat -an1//
//<server>: //
//Req'd  Req'd   Elap//
//Job ID               Username Queue    Jobname SessID NDS   TSK Memory 
Time  S Time//
//-------------------- -------- -------- ---------------- ------ ----- 
--- ------ ----- - -----//
//10.<server>     aforti   long     pbs-vm3.sh --    --   --    --    
--  Q   --     -- //
//11.s<server>    aforti   long     pbs-vm3.sh --    --   --    --    
--  Q   --     -- /

2) If I run qrun <jobid> the job runs so I assume the problem is not 
between torque server and torque mom.
3) When I use showq on the old versions displayed the WCLimit of the 
default queue now it displays 0 at first and then it changes it by 
itself to 100 days

/[root@//<server> maui]# showq//
//ACTIVE JOBS--------------------//
//JOBNAME            USERNAME      STATE  PROC REMAINING            
//     0 Active Jobs       0 of   16 Processors Active (0.00%)//
//                         0 of    1 Nodes Active (0.00%)//
//IDLE JOBS----------------------//
//JOBNAME            USERNAME      STATE  PROC WCLIMIT            
//10                   aforti       Idle     1 99:23:59:59 Tue Oct  9 
//11                   aforti       Idle     1 99:23:59:59 Tue Oct  9 
//2 Idle Jobs//
//BLOCKED JOBS----------------//
//JOBNAME            USERNAME      STATE  PROC WCLIMIT            
//Total Jobs: 2   Active Jobs: 0   Idle Jobs: 2   Blocked Jobs: 0//
4) Checkjob <jobid> just tells me the job cannot be run in the default 
partition without any particular reason

PE:  1.00  StartPriority:  120//
//cannot select job 10 for partition DEFAULT (Class)/

5) Checknode can see the node free if it wasn't clear from other commands

/[root@//<server> maui]# !checkno//
//checknode <node>//
//checking node <node>//
//State:      Idle  (in current state for 00:55:10)//
//Configured Resources: PROCS: 16  MEM: 23G  SWAP: 31G  DISK: 1M//
//Utilized   Resources: SWAP: 202M//
//Dedicated  Resources: [NONE]//
//Opsys:         linux  Arch:      [NONE]//
//Speed:      1.00  Load:       0.000//
//Network:    [DEFAULT]//
//Features:   [lcgpro]//
//Attributes: [Batch]//
//Classes:    [DEFAULT 1:1]//
//Total Time: 3:06:35  Up: 3:06:24 (99.90%)  Active: 00:00:10 (0.09%)//
//NOTE:  no reservations on node/

6) When I use showbf -v though it says my nodes are blocked by 
reservations despite checknode clearly telling me there are no 
reservations on that node. In our local maui.cfg there is a reservation 
for 1 proc I'm not sure why it blocks the whole node

/[root@//<server2> server_logs]# showbf -v//
//backfill window (user: 'root' group: 'root' partition: ALL) Tue Oct  9 
//  3 procs available with no timelimit//
//node <node2> is blocked by reservation sft.0.0 in INFINITY//
But to be sure I removed it and even when I remove the reservation and 
reduce the maui.cfg to the default version without anything in it it 
tells me the node is blocked by "reservation NONE in INFINITY"

/[root@//<server> maui]# showbf -v//
//backfill window (user: 'root' group: 'root' partition: ALL) Tue Oct  9 
// 16 procs available with no timelimit//
//node <node> is blocked by reservation NONE in INFINITY//
I'm not sure how to proceed because the log files don't tell me anything 
and all the references I have found to a similar problem have remained 

Thanks for any help here are the rpms I used

the maui.cfg

# MAUI configuration example
# @(#)maui.cfg David Groep 20031015.1
# for MAUI version 3.2.5
SERVERHOST              <server>/
/ADMIN1                  root
ADMINHOST               <server>/
/RMTYPE[0]           PBS
RMHOST[0]           <server>/
/RMSERVER[0]         <server>/
SERVERPORT            40559

# Set PBS server polling interval. Since we have many short jobs
# and want fast turn-around, set this to 10 seconds (default: 2 minutes)
RMPOLLINTERVAL        00:00:10

# a max. 10 MByte log file in a logical location
LOGFILE               /var/log/maui.log
LOGFILEMAXSIZE        10000000
LOGLEVEL              3/

and Torque config

/create queue long//
//set queue long queue_type = Execution//
//set queue long acl_hosts = localhost//
//set queue long acl_hosts += <server>//
//set queue long resources_max.cput = 48:00:00//
//set queue long resources_max.walltime = 72:00:00//
//set queue long acl_group_enable = True//
//set queue long acl_groups = aforti//
//set queue long enabled = True//
//set queue long started = True//
//# Set server attributes.//
//set server scheduling = True//
//set server acl_host_enable = False//
//set server acl_hosts = <server>//
//set server acl_hosts += localhost//
//set server default_queue = long//
//set server log_events = 511//
//set server mail_from = adm//
//set server next_job_number = 12/

Facts aren't facts if they come from the wrong people. (Paul Krugman)

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20121009/de788aae/attachment-0001.html 

More information about the mauiusers mailing list