[torqueusers] 2 Jobs per node limit in Maui?

André Gemünd andre.gemuend at scai.fraunhofer.de
Fri Mar 28 10:15:30 MDT 2014


Hi Nico,

could you please unset queue batch resources_default.nodect and resources_default.neednodes and check again. neednodes is for node attributes specified in the nodes file. I guess you've got no attribute named "1" and just wanted to specify the number of default nodes. nodect is not required if you specify nodes. 
Its interesting that your job is actually qualified to run by the Scheduler. You say it stays in the queue like that? Then it must be Torque, not moab. 

Does tracejob tell you anything about attempts to start the job or anything?

Greetings
Andre

----- Ursprüngliche Mail -----
> Hi Gus and Andre,
> 
> thanks for your suggestions. I've checked the various outputs without
> getting a real clue as to what is causing this behaviour. We don't
> use any special job scheduling or priorization, so I left the
> configuration largely as it was installed. I've included the outputs
> below, so if anyone notices a smoking gun, please speak up.
> 
> However, we found one interesting thing: setting the walltime in qsub
> motivates Maui to start another job from the same user on the node,
> while jobs from other users may start as well (they don't always).
> 
> Again, any help or suggestions are greatly appreciated.
> 
> Best regards,
> 
>      NIco
> 
> 
> 
> The checkjob output of the first waiting job:
> 
> ============
> checking job 37001
> 
> State: Idle
> Creds:  user:*****  group:*****  class:batch  qos:DEFAULT
> WallTime: 00:00:00 of 5:00:00:00
> SubmitTime: Fri Mar 28 13:46:07
>   (Time Queued  Total: 2:27:35  Eligible: 2:26:05)
> 
> Total Tasks: 1
> 
> Req[0]  TaskCount: 1  Partition: ALL
> Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
> Opsys: [NONE]  Arch: [NONE]  Features: [ccdn205]     <<< name of the
> node where the job is to be run
> 
> IWD: [NONE]  Executable:  [NONE]
> Bypass: 0  StartCount: 0
> PartitionMask: [ALL]
> PE:  1.00  StartPriority:  146
> job can run in partition DEFAULT (5 procs available.  1 procs
> required)
> =============
> 
> The server configuration:
> 
> create queue batch
> set queue batch queue_type = Execution
> set queue batch resources_default.neednodes = 1
> set queue batch resources_default.nodect = 1
> set queue batch resources_default.nodes = 1
> set queue batch resources_default.walltime = 9999:00:00
> set queue batch enabled = True
> set queue batch started = True
> #
> # Set server attributes.
> #
> set server scheduling = True
> set server default_queue = batch
> set server log_events = 511
> set server mail_from = adm
> set server query_other_jobs = True
> set server scheduler_iteration = 60
> set server node_check_rate = 150
> set server tcp_timeout = 300
> set server job_stat_rate = 45
> set server poll_jobs = True
> set server mom_job_sync = True
> set server keep_completed = 60
> set server next_job_number = 37030
> set server moab_array_compatible = True
> 
> The Maui showconfig output:
> 
> # Maui version 3.3.1 (PID: 1980)
> # global policies
> 
> REJECTNEGPRIOJOBS[0]              FALSE
> ENABLENEGJOBPRIORITY[0]           FALSE
> ENABLEMULTINODEJOBS[0]            TRUE
> ENABLEMULTIREQJOBS[0]             FALSE
> BFPRIORITYPOLICY[0]               [NONE]
> JOBPRIOACCRUALPOLICY            QUEUEPOLICY
> NODELOADPOLICY                  ADJUSTSTATE
> USEMACHINESPEEDFORFS            FALSE
> USEMACHINESPEED                 FALSE
> USESYSTEMQUEUETIME              TRUE
> USELOCALMACHINEPRIORITY         FALSE
> NODEUNTRACKEDLOADFACTOR         1.2
> JOBNODEMATCHPOLICY[0]
> JOBMAXSTARTTIME[0]                  INFINITY
> METAMAXTASKS[0]                   0
> NODESETPOLICY[0]                  [NONE]
> NODESETATTRIBUTE[0]               [NONE]
> NODESETLIST[0]
> NODESETDELAY[0]                   00:00:00
> NODESETPRIORITYTYPE[0]            MINLOSS
> NODESETTOLERANCE[0]                 0.00
> BACKFILLPOLICY[0]                 FIRSTFIT
> BACKFILLDEPTH[0]                  0
> BACKFILLPROCFACTOR[0]             0
> BACKFILLMAXSCHEDULES[0]           10000
> BACKFILLMETRIC[0]                 PROCS
> BFCHUNKDURATION[0]                00:00:00
> BFCHUNKSIZE[0]                    0
> PREEMPTPOLICY[0]                  REQUEUE
> MINADMINSTIME[0]                  00:00:00
> RESOURCELIMITPOLICY[0]
> NODEAVAILABILITYPOLICY[0]         COMBINED:[DEFAULT]
> NODEALLOCATIONPOLICY[0]           MINRESOURCE
> TASKDISTRIBUTIONPOLICY[0]         DEFAULT
> RESERVATIONPOLICY[0]              CURRENTHIGHEST
> RESERVATIONRETRYTIME[0]           00:00:00
> RESERVATIONTHRESHOLDTYPE[0]       NONE
> RESERVATIONTHRESHOLDVALUE[0]      0
> FSPOLICY                        [NONE]
> FSPOLICY                        [NONE]
> FSINTERVAL                      12:00:00
> FSDEPTH                         8
> FSDECAY                         1.00
> 
> # Priority Weights
> SERVICEWEIGHT[0]                  1
> TARGETWEIGHT[0]                   1
> CREDWEIGHT[0]                     1
> ATTRWEIGHT[0]                     1
> FSWEIGHT[0]                       1
> RESWEIGHT[0]                      1
> USAGEWEIGHT[0]                    1
> QUEUETIMEWEIGHT[0]                1
> XFACTORWEIGHT[0]                  0
> SPVIOLATIONWEIGHT[0]              0
> BYPASSWEIGHT[0]                   0
> TARGETQUEUETIMEWEIGHT[0]          0
> TARGETXFACTORWEIGHT[0]            0
> USERWEIGHT[0]                     0
> GROUPWEIGHT[0]                    0
> ACCOUNTWEIGHT[0]                  0
> QOSWEIGHT[0]                      0
> CLASSWEIGHT[0]                    0
> FSUSERWEIGHT[0]                   0
> FSGROUPWEIGHT[0]                  0
> FSACCOUNTWEIGHT[0]                0
> FSQOSWEIGHT[0]                    0
> FSCLASSWEIGHT[0]                  0
> ATTRATTRWEIGHT[0]                 0
> ATTRSTATEWEIGHT[0]                0
> NODEWEIGHT[0]                     0
> PROCWEIGHT[0]                     0
> MEMWEIGHT[0]                      0
> SWAPWEIGHT[0]                     0
> DISKWEIGHT[0]                     0
> PSWEIGHT[0]                       0
> PEWEIGHT[0]                       0
> WALLTIMEWEIGHT[0]                 0
> UPROCWEIGHT[0]                    0
> UJOBWEIGHT[0]                     0
> CONSUMEDWEIGHT[0]                 0
> USAGEEXECUTIONTIMEWEIGHT[0]       0
> REMAININGWEIGHT[0]                0
> PERCENTWEIGHT[0]                  0
> XFMINWCLIMIT[0]                   00:02:00
> 
> # partition DEFAULT policies
> REJECTNEGPRIOJOBS[1]              FALSE
> ENABLENEGJOBPRIORITY[1]           FALSE
> ENABLEMULTINODEJOBS[1]            TRUE
> ENABLEMULTIREQJOBS[1]             FALSE
> BFPRIORITYPOLICY[1]               [NONE]
> JOBPRIOACCRUALPOLICY            QUEUEPOLICY
> NODELOADPOLICY                  ADJUSTSTATE
> JOBNODEMATCHPOLICY[1]
> JOBMAXSTARTTIME[1]                  INFINITY
> METAMAXTASKS[1]                   0
> NODESETPOLICY[1]                  [NONE]
> NODESETATTRIBUTE[1]               [NONE]
> NODESETLIST[1]
> NODESETDELAY[1]                   00:00:00
> NODESETPRIORITYTYPE[1]            MINLOSS
> NODESETTOLERANCE[1]                 0.00
> 
> # Priority Weights
> XFMINWCLIMIT[1]                   00:00:00
> RMAUTHTYPE[0]                     CHECKSUM
> CLASSCFG[[NONE]]  DEFAULT.FEATURES=[NONE]
> CLASSCFG[[ALL]]  DEFAULT.FEATURES=[NONE]
> CLASSCFG[batch]  DEFAULT.FEATURES=[NONE]
> QOSPRIORITY[0]                    0
> QOSQTWEIGHT[0]                    0
> QOSXFWEIGHT[0]                    0
> QOSTARGETXF[0]                      0.00
> QOSTARGETQT[0]                    00:00:00
> QOSFLAGS[0]
> QOSPRIORITY[1]                    0
> QOSQTWEIGHT[1]                    0
> QOSXFWEIGHT[1]                    0
> QOSTARGETXF[1]                      0.00
> QOSTARGETQT[1]                    00:00:00
> QOSFLAGS[1]
> 
> # SERVER MODULES:  MX
> SERVERMODE                      NORMAL
> SERVERNAME
> SERVERHOST                      *****
> SERVERPORT                      42559
> LOGFILE                         maui.log
> LOGFILEMAXSIZE                  10000000
> LOGFILEROLLDEPTH                1
> LOGLEVEL                        3
> LOGFACILITY                     fALL
> SERVERHOMEDIR                   /usr/local/maui/
> TOOLSDIR                        /usr/local/maui/tools/
> LOGDIR                          /usr/local/maui/log/
> STATDIR                         /usr/local/maui/stats/
> LOCKFILE                        /usr/local/maui/maui.pid
> SERVERCONFIGFILE                /usr/local/maui/maui.cfg
> CHECKPOINTFILE                  /usr/local/maui/maui.ck
> CHECKPOINTINTERVAL              00:05:00
> CHECKPOINTEXPIRATIONTIME        3:11:20:00
> TRAPJOB
> TRAPNODE
> TRAPFUNCTION
> RESDEPTH                        24
> RMPOLLINTERVAL                  00:00:30
> NODEACCESSPOLICY                SHARED
> ALLOCLOCALITYPOLICY             [NONE]
> SIMTIMEPOLICY                   [NONE]
> ADMIN1                          root
> ADMINHOSTS                      ALL
> NODEPOLLFREQUENCY               0
> DISPLAYFLAGS
> DEFAULTDOMAIN
> DEFAULTCLASSLIST                [DEFAULT:1]
> FEATURENODETYPEHEADER
> FEATUREPROCSPEEDHEADER
> FEATUREPARTITIONHEADER
> DEFERTIME                       1:00:00
> DEFERCOUNT                      24
> DEFERSTARTCOUNT                 1
> JOBPURGETIME                    0
> NODEPURGETIME                   2140000000
> APIFAILURETHRESHHOLD            6
> NODESYNCTIME                    600
> JOBSYNCTIME                     600
> JOBMAXOVERRUN                   00:10:00
> NODEMAXLOAD                     0.0
> PLOTMINTIME                     120
> PLOTMAXTIME                     245760
> PLOTTIMESCALE                   11
> PLOTMINPROC                     1
> PLOTMAXPROC                     512
> PLOTPROCSCALE                   9
> SCHEDCFG[]                        MODE=NORMAL SERVER=*****:42559
> 
> # RM MODULES: PBS SSS WIKI NATIVE
> RMCFG[*****] AUTHTYPE=CHECKSUM EPORT=15004 TIMEOUT=00:00:09 TYPE=PBS
> SIMWORKLOADTRACEFILE            workload
> SIMRESOURCETRACEFILE            resource
> SIMAUTOSHUTDOWN                 OFF
> SIMSTARTTIME                    0
> SIMSCALEJOBRUNTIME              FALSE
> SIMFLAGS
> SIMJOBSUBMISSIONPOLICY          CONSTANTJOBDEPTH
> SIMINITIALQUEUEDEPTH            16
> SIMWCACCURACY                   0.00
> SIMWCACCURACYCHANGE             0.00
> SIMNODECOUNT                    0
> SIMNODECONFIGURATION            NORMAL
> SIMWCSCALINGPERCENT             100
> SIMCOMRATE                      0.10
> SIMCOMTYPE                      ROUNDROBIN
> COMINTRAFRAMECOST               0.30
> COMINTERFRAMECOST               0.30
> SIMSTOPITERATION                -1
> SIMEXITITERATION                -1
> 
> 
> On 27 Mär. 2014, at 20:23 , André Gemünd
> <andre.gemuend at scai.fraunhofer.de> wrote:
> 
> > Hi Nicola,
> > 
> > Could you post a checkjob output of a job that is queued? It should
> > have a reason for the job state, e.g. deferred, no resources
> > available or something like that. There can be other resources
> > than cores that inhibit execution, like memory, ncpus, sharing
> > restrictions, etc.
> > 
> > Of course, the 'p s' and showconfig output as Gus said would be
> > most helpful, just in case.
> > 
> > Greetings
> > Andre
> > 
> > ----- Ursprüngliche Mail -----
> >> Hello!
> >> 
> >> At our site, we're currently using Torque 4.2.5 with Maui 3.3.1,
> >> the
> >> latter with the default configuration. This setup works fine, but
> >> we've noticed that Maui appears to limit the number of jobs per
> >> node
> >> to 2: when a user submits 8 jobs that request ppn=1 to a
> >> particular
> >> np=8 node, only 2 jobs are started by the scheduler. The remaining
> >> 6
> >> jobs must be started via qrun.
> >> 
> >> I could not find anything that might be related to this behaviour
> >> in
> >> the documentation (links in maui.cfg are broken btw); the
> >> showconfig
> >> command doesn't list any parameter with a value of 2, so that
> >> didn't
> >> help either.
> >> 
> >> Any help/suggestions how to make Maui start more jobs per node are
> >> greatly appreciated.
> >> 
> >> Best regards,
> >> 
> >>    Nico van Eikema Hommes
> 
> --
> Dr. N.J.R. van Eikema Hommes   Computer-Chemie-Centrum
> E-Mail:   nico.hommes at fau.de   Universitaet Erlangen-Nuernberg
> Phone:      +49-9131-8520402   Naegelsbachstr. 25
> FAX:        +49-9131-8520404   91052 Erlangen, Germany
> 
> 
> 
> 
> 
> 
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
> 

-- 
André Gemünd
Fraunhofer-Institute for Algorithms and Scientific Computing
andre.gemuend at scai.fraunhofer.de
Tel: +49 2241 14-2193
/C=DE/O=Fraunhofer/OU=SCAI/OU=People/CN=Andre Gemuend


More information about the torqueusers mailing list