[Mauiusers] Maui problem

Alessandra Forti Alessandra.Forti at cern.ch
Wed Oct 10 06:57:40 MDT 2012


Further information:

if I increase the maui loglevel to 9 I hundreds of these messages

/10/10 13:37:39 MRMCheckEvents()//
//10/10 13:37:39 INFO:     no PBS sched socket connections ready//
//10/10 13:37:39 MSUAcceptClient(6,ClientSD,HostName,TCP)//
//10/10 13:37:39 INFO:     accept call failed, errno: 11 (Resource 
temporarily unavailable)//
//10/10 13:37:39 INFO:     all clients connected. servicing requests//
/
if I reduce it to 8 I get on top of other stuff

/10/10 13:55:11 MJobCheckLimits(2,HARD,P,8,Message)//
//10/10 13:55:11 INFO:     job 2 rejected, partition DEFAULT (classes 
not supported '[long 1:0][DEFAULT 0:1]')//
//10/10 13:55:11 INFO:     total jobs selected in partition DEFAULT: 0/1 
[Class: 1]//
//
/cheers
alessandra


On 09/10/2012 18:04, Alessandra Forti wrote:
> Hi,
>
> I have installed a mini test cluster with torque and maui. We have 
> used maui/torque for years on our grid cluster and now we are 
> upgrading to torque 2.5.7 and maui 3.3-4. Unfortunately with this new 
> combination maui doesn't seem to work correctly. When I submit jobs 
> and it behaves as if there weren't any free resources. Even when I 
> tried to install only torque and maui with a bare minimum 
> configuration I got the same behaviour, i.e.
>
> 1) When I submit the jobs just remain queued
>
> //[root@//<server> maui]# /qstat -an1//
> //
> //<server>: //
> //Req'd  Req'd   Elap//
> //Job ID               Username Queue    Jobname SessID NDS   TSK 
> Memory Time  S Time//
> //-------------------- -------- -------- ---------------- ------ ----- 
> --- ------ ----- - -----//
> //10.<server>     aforti   long pbs-vm3.sh          --    --   --    
> --    --  Q   --     -- //
> //11.s<server>    aforti   long pbs-vm3.sh          --    --   --    
> --    --  Q   --     -- /
>
> 2) If I run qrun <jobid> the job runs so I assume the problem is not 
> between torque server and torque mom.
> 3) When I use showq on the old versions displayed the WCLimit of the 
> default queue now it displays 0 at first and then it changes it by 
> itself to 100 days
>
> /[root@//<server> maui]# showq//
> //ACTIVE JOBS--------------------//
> //JOBNAME            USERNAME      STATE  PROC REMAINING            
> STARTTIME//
> //
> //
> //     0 Active Jobs       0 of   16 Processors Active (0.00%)//
> //                         0 of    1 Nodes Active (0.00%)//
> //
> //IDLE JOBS----------------------//
> //JOBNAME            USERNAME      STATE  PROC WCLIMIT            
> QUEUETIME//
> //
> //10                   aforti       Idle     1 99:23:59:59 Tue Oct  9 
> 15:32:13//
> //11                   aforti       Idle     1 99:23:59:59 Tue Oct  9 
> 16:39:09//
> //
> //2 Idle Jobs//
> //
> //BLOCKED JOBS----------------//
> //JOBNAME            USERNAME      STATE  PROC WCLIMIT            
> QUEUETIME//
> //
> //
> //Total Jobs: 2   Active Jobs: 0   Idle Jobs: 2   Blocked Jobs: 0//
> /
> 4) Checkjob <jobid> just tells me the job cannot be run in the default 
> partition without any particular reason
>
> /[.....]
> PE:  1.00  StartPriority:  120//
> //cannot select job 10 for partition DEFAULT (Class)/
>
> 5) Checknode can see the node free if it wasn't clear from other commands
>
> /[root@//<server> maui]# !checkno//
> //checknode <node>//
> //
> //checking node <node>//
> //
> //State:      Idle  (in current state for 00:55:10)//
> //Configured Resources: PROCS: 16  MEM: 23G  SWAP: 31G DISK: 1M//
> //Utilized   Resources: SWAP: 202M//
> //Dedicated  Resources: [NONE]//
> //Opsys:         linux  Arch:      [NONE]//
> //Speed:      1.00  Load:       0.000//
> //Network:    [DEFAULT]//
> //Features:   [lcgpro]//
> //Attributes: [Batch]//
> //Classes:    [DEFAULT 1:1]//
> //
> //Total Time: 3:06:35  Up: 3:06:24 (99.90%)  Active: 00:00:10 (0.09%)//
> //
> //Reservations://
> //NOTE:  no reservations on node/
>
> 6) When I use showbf -v though it says my nodes are blocked by 
> reservations despite checknode clearly telling me there are no 
> reservations on that node. In our local maui.cfg there is a 
> reservation for 1 proc I'm not sure why it blocks the whole node
>
> /[root@//<server2> server_logs]# showbf -v//
> //backfill window (user: 'root' group: 'root' partition: ALL) Tue Oct  
> 9 17:08:59//
> //
> //  3 procs available with no timelimit//
> //
> //node <node2> is blocked by reservation sft.0.0 in INFINITY//
> /
> But to be sure I removed it and even when I remove the reservation and 
> reduce the maui.cfg to the default version without anything in it it 
> tells me the node is blocked by "reservation NONE in INFINITY"
>
> /[root@//<server> maui]# showbf -v//
> //backfill window (user: 'root' group: 'root' partition: ALL) Tue Oct  
> 9 17:37:58//
> //
> // 16 procs available with no timelimit//
> //
> //node <node> is blocked by reservation NONE in INFINITY//
> /
> I'm not sure how to proceed because the log files don't tell me 
> anything and all the references I have found to a similar problem have 
> remained unanswered.
>
> Thanks for any help here are the rpms I used
>
> /maui-3.3-4.el5//
> //maui-client-3.3-4.el5//
> //maui-server-3.3-4.el5//
> //torque-2.5.7-7.el5//
> //torque-client-2.5.7-7.el5//
> //torque-server-2.5.7-7.el5//
> //libtorque-2.5.7-7.el5//
> /
> the maui.cfg
>
> /#
> # MAUI configuration example
> # @(#)maui.cfg David Groep 20031015.1
> # for MAUI version 3.2.5
> #
> SERVERHOST              <server>/
> /ADMIN1                  root
> ADMINHOST               <server>/
> /RMTYPE[0]           PBS
> RMHOST[0]           <server>/
> /RMSERVER[0]         <server>/
> /
> SERVERPORT            40559
> SERVERMODE            NORMAL
>
> # Set PBS server polling interval. Since we have many short jobs
> # and want fast turn-around, set this to 10 seconds (default: 2 minutes)
> RMPOLLINTERVAL        00:00:10
>
> # a max. 10 MByte log file in a logical location
> LOGFILE               /var/log/maui.log
> LOGFILEMAXSIZE        10000000
> LOGLEVEL              3/
>
> and Torque config
>
> /create queue long//
> //set queue long queue_type = Execution//
> //set queue long acl_hosts = localhost//
> //set queue long acl_hosts += <server>//
> //set queue long resources_max.cput = 48:00:00//
> //set queue long resources_max.walltime = 72:00:00//
> //set queue long acl_group_enable = True//
> //set queue long acl_groups = aforti//
> //set queue long enabled = True//
> //set queue long started = True//
> //#//
> //# Set server attributes.//
> //#//
> //set server scheduling = True//
> //set server acl_host_enable = False//
> //set server acl_hosts = <server>//
> //set server acl_hosts += localhost//
> //set server default_queue = long//
> //set server log_events = 511//
> //set server mail_from = adm//
> //set server next_job_number = 12/
> -- 
> Facts aren't facts if they come from the wrong people. (Paul Krugman)


-- 
Facts aren't facts if they come from the wrong people. (Paul Krugman)

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20121010/951d65b0/attachment-0001.html 


More information about the mauiusers mailing list