[torqueusers] 2 Jobs per node limit in Maui?

Gus Correa gus at ldeo.columbia.edu
Fri Mar 28 10:50:08 MDT 2014


Hi Nico

Thanks for sending the additional information.
I can't see any smoking gun either, maybe Andre can spot something.
Sometimes it is a nail on a haystack.

Some vague possibilities:

1) Since you mentioned the curious walltime behavior.
I never used a default so large.

 > set queue batch resources_default.walltime = 9999:00:00

Our production queues normally have a 12 max (with default=max).
The maximum I used so far was 288:00:00 for a limited time to
acommodate one long serial job of a specific user.
It worked.

However, I wonder if Torque may have some hardwired limit, or get
confused when parsing the 9999 hours, although this is just a wild guess 
(a Torque developer could comment on this).
You could test by reducing that number and restarting the pbs_server.

2) I had significant problems with jobs hanging in Q state until recently.
You may not have the same problem, but I wonder if you are using NFS
over RDMA to share users' home directories, scratch space, or other.

The problem here was not on Torque, but would raise havoc on Torque.
Specifically it was NFSv4 over RDMA that would break when the
job was in its early stages, and while pbs_mom spawns copies of itself
owned by the user.
Those copies would never go away (because of the NFS problem),
and would break the communication between that pbs_mom a the pbs_server,
resulting on jobs in Q state.

It happened on some nodes, not all of them, with some jobs, not all of 
them, it was random, and not reproducible, hard to nail down.
It took a while to sort that one out.
If you dig the list archives you will see my postings about that.

Symptoms of this were:
A. Copies of pbs_mom lingering on the nodes (ps aux |grep pbs_mom on the 
nodes will tell).
B. Jobs on Q state with nodes available
C. Maui logs and Torque server logs helped, as they suggested a 
communication problem with the affected nodes.
D. /var/log/messages and dmesg showed the NFS errors (as I eventually found)

The fix I applied was to move from NFS over RDMA
back to NFS over IPoIB (TCP/IP over IB).
It has be working well ever since.

However, I think if you use NFS over TCP/IP (Ethernet or IB) that won't 
be a problem for you, although it is always worth to keep an eye on NFS.

3) Some tedious things to check:

- Are you running trqauthd on every node, besides pbs_mom?
- Are you running trqauthd on the head node (where pbs_server runs)?
- Did you check the Maui log to see how these Q jobs are handled?
- Did you check the Torque server logs to see if there is any clue
about communication with the pbs_mom on the nodes where these serial
jobs are running?

I hope this helps,
Gus Correa

PS - I just saw Andre's message.
He spotted an important glitch on resources_default.neednodes.
Indeed, that is what we use to tag nodes with "properties".
and match them on different queues.
E.g.:
In the queue's configuration (qmgr):

set queue production resources_default.neednodes = prod
...
set queue development resources_default.neednodes = dev

and in the nodes file:

node01 np=8 prod
...
node32 np=8 dev

Just to clarify that you're using Maui, not Moab, right?

On 03/28/2014 11:43 AM, Dr. Nico van Eikema Hommes wrote:
> Hi Gus and Andre,
>
> thanks for your suggestions.
I've checked the various outputs without getting a real clue as to what
is causing this behaviour. We don't use any special job scheduling or
priorization, so I left the configuration largely as it was installed.
I've included the outputs below, so if anyone notices a smoking gun,
please speak up.
>
> However, we found one interesting thing: setting the walltime in
qsub motivates Maui to start another job from the same user on the
node, while jobs from other users may start as well (they don't always).
>
> Again, any help or suggestions are greatly appreciated.
>
> Best regards,
>
>       NIco
>
>
>
> The checkjob output of the first waiting job:
>
> ============
> checking job 37001
>
> State: Idle
> Creds:  user:*****  group:*****  class:batch  qos:DEFAULT
> WallTime: 00:00:00 of 5:00:00:00
> SubmitTime: Fri Mar 28 13:46:07
>    (Time Queued  Total: 2:27:35  Eligible: 2:26:05)
>
> Total Tasks: 1
>
> Req[0]  TaskCount: 1  Partition: ALL
> Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
> Opsys: [NONE]  Arch: [NONE]  Features: [ccdn205]     <<< name of the node where the job is to be run
>
> IWD: [NONE]  Executable:  [NONE]
> Bypass: 0  StartCount: 0
> PartitionMask: [ALL]
> PE:  1.00  StartPriority:  146
> job can run in partition DEFAULT (5 procs available.  1 procs required)
> =============
>
> The server configuration:
>
> create queue batch
> set queue batch queue_type = Execution
> set queue batch resources_default.neednodes = 1
> set queue batch resources_default.nodect = 1
> set queue batch resources_default.nodes = 1
> set queue batch resources_default.walltime = 9999:00:00
> set queue batch enabled = True
> set queue batch started = True
> #
> # Set server attributes.
> #
> set server scheduling = True
> set server default_queue = batch
> set server log_events = 511
> set server mail_from = adm
> set server query_other_jobs = True
> set server scheduler_iteration = 60
> set server node_check_rate = 150
> set server tcp_timeout = 300
> set server job_stat_rate = 45
> set server poll_jobs = True
> set server mom_job_sync = True
> set server keep_completed = 60
> set server next_job_number = 37030
> set server moab_array_compatible = True
>
> The Maui showconfig output:
>
> # Maui version 3.3.1 (PID: 1980)
> # global policies
>
> REJECTNEGPRIOJOBS[0]              FALSE
> ENABLENEGJOBPRIORITY[0]           FALSE
> ENABLEMULTINODEJOBS[0]            TRUE
> ENABLEMULTIREQJOBS[0]             FALSE
> BFPRIORITYPOLICY[0]               [NONE]
> JOBPRIOACCRUALPOLICY            QUEUEPOLICY
> NODELOADPOLICY                  ADJUSTSTATE
> USEMACHINESPEEDFORFS            FALSE
> USEMACHINESPEED                 FALSE
> USESYSTEMQUEUETIME              TRUE
> USELOCALMACHINEPRIORITY         FALSE
> NODEUNTRACKEDLOADFACTOR         1.2
> JOBNODEMATCHPOLICY[0]
> JOBMAXSTARTTIME[0]                  INFINITY
> METAMAXTASKS[0]                   0
> NODESETPOLICY[0]                  [NONE]
> NODESETATTRIBUTE[0]               [NONE]
> NODESETLIST[0]
> NODESETDELAY[0]                   00:00:00
> NODESETPRIORITYTYPE[0]            MINLOSS
> NODESETTOLERANCE[0]                 0.00
> BACKFILLPOLICY[0]                 FIRSTFIT
> BACKFILLDEPTH[0]                  0
> BACKFILLPROCFACTOR[0]             0
> BACKFILLMAXSCHEDULES[0]           10000
> BACKFILLMETRIC[0]                 PROCS
> BFCHUNKDURATION[0]                00:00:00
> BFCHUNKSIZE[0]                    0
> PREEMPTPOLICY[0]                  REQUEUE
> MINADMINSTIME[0]                  00:00:00
> RESOURCELIMITPOLICY[0]
> NODEAVAILABILITYPOLICY[0]         COMBINED:[DEFAULT]
> NODEALLOCATIONPOLICY[0]           MINRESOURCE
> TASKDISTRIBUTIONPOLICY[0]         DEFAULT
> RESERVATIONPOLICY[0]              CURRENTHIGHEST
> RESERVATIONRETRYTIME[0]           00:00:00
> RESERVATIONTHRESHOLDTYPE[0]       NONE
> RESERVATIONTHRESHOLDVALUE[0]      0
> FSPOLICY                        [NONE]
> FSPOLICY                        [NONE]
> FSINTERVAL                      12:00:00
> FSDEPTH                         8
> FSDECAY                         1.00
>
> # Priority Weights
> SERVICEWEIGHT[0]                  1
> TARGETWEIGHT[0]                   1
> CREDWEIGHT[0]                     1
> ATTRWEIGHT[0]                     1
> FSWEIGHT[0]                       1
> RESWEIGHT[0]                      1
> USAGEWEIGHT[0]                    1
> QUEUETIMEWEIGHT[0]                1
> XFACTORWEIGHT[0]                  0
> SPVIOLATIONWEIGHT[0]              0
> BYPASSWEIGHT[0]                   0
> TARGETQUEUETIMEWEIGHT[0]          0
> TARGETXFACTORWEIGHT[0]            0
> USERWEIGHT[0]                     0
> GROUPWEIGHT[0]                    0
> ACCOUNTWEIGHT[0]                  0
> QOSWEIGHT[0]                      0
> CLASSWEIGHT[0]                    0
> FSUSERWEIGHT[0]                   0
> FSGROUPWEIGHT[0]                  0
> FSACCOUNTWEIGHT[0]                0
> FSQOSWEIGHT[0]                    0
> FSCLASSWEIGHT[0]                  0
> ATTRATTRWEIGHT[0]                 0
> ATTRSTATEWEIGHT[0]                0
> NODEWEIGHT[0]                     0
> PROCWEIGHT[0]                     0
> MEMWEIGHT[0]                      0
> SWAPWEIGHT[0]                     0
> DISKWEIGHT[0]                     0
> PSWEIGHT[0]                       0
> PEWEIGHT[0]                       0
> WALLTIMEWEIGHT[0]                 0
> UPROCWEIGHT[0]                    0
> UJOBWEIGHT[0]                     0
> CONSUMEDWEIGHT[0]                 0
> USAGEEXECUTIONTIMEWEIGHT[0]       0
> REMAININGWEIGHT[0]                0
> PERCENTWEIGHT[0]                  0
> XFMINWCLIMIT[0]                   00:02:00
>
> # partition DEFAULT policies
> REJECTNEGPRIOJOBS[1]              FALSE
> ENABLENEGJOBPRIORITY[1]           FALSE
> ENABLEMULTINODEJOBS[1]            TRUE
> ENABLEMULTIREQJOBS[1]             FALSE
> BFPRIORITYPOLICY[1]               [NONE]
> JOBPRIOACCRUALPOLICY            QUEUEPOLICY
> NODELOADPOLICY                  ADJUSTSTATE
> JOBNODEMATCHPOLICY[1]
> JOBMAXSTARTTIME[1]                  INFINITY
> METAMAXTASKS[1]                   0
> NODESETPOLICY[1]                  [NONE]
> NODESETATTRIBUTE[1]               [NONE]
> NODESETLIST[1]
> NODESETDELAY[1]                   00:00:00
> NODESETPRIORITYTYPE[1]            MINLOSS
> NODESETTOLERANCE[1]                 0.00
>
> # Priority Weights
> XFMINWCLIMIT[1]                   00:00:00
> RMAUTHTYPE[0]                     CHECKSUM
> CLASSCFG[[NONE]]  DEFAULT.FEATURES=[NONE]
> CLASSCFG[[ALL]]  DEFAULT.FEATURES=[NONE]
> CLASSCFG[batch]  DEFAULT.FEATURES=[NONE]
> QOSPRIORITY[0]                    0
> QOSQTWEIGHT[0]                    0
> QOSXFWEIGHT[0]                    0
> QOSTARGETXF[0]                      0.00
> QOSTARGETQT[0]                    00:00:00
> QOSFLAGS[0]
> QOSPRIORITY[1]                    0
> QOSQTWEIGHT[1]                    0
> QOSXFWEIGHT[1]                    0
> QOSTARGETXF[1]                      0.00
> QOSTARGETQT[1]                    00:00:00
> QOSFLAGS[1]
>
> # SERVER MODULES:  MX
> SERVERMODE                      NORMAL
> SERVERNAME
> SERVERHOST                      *****
> SERVERPORT                      42559
> LOGFILE                         maui.log
> LOGFILEMAXSIZE                  10000000
> LOGFILEROLLDEPTH                1
> LOGLEVEL                        3
> LOGFACILITY                     fALL
> SERVERHOMEDIR                   /usr/local/maui/
> TOOLSDIR                        /usr/local/maui/tools/
> LOGDIR                          /usr/local/maui/log/
> STATDIR                         /usr/local/maui/stats/
> LOCKFILE                        /usr/local/maui/maui.pid
> SERVERCONFIGFILE                /usr/local/maui/maui.cfg
> CHECKPOINTFILE                  /usr/local/maui/maui.ck
> CHECKPOINTINTERVAL              00:05:00
> CHECKPOINTEXPIRATIONTIME        3:11:20:00
> TRAPJOB
> TRAPNODE
> TRAPFUNCTION
> RESDEPTH                        24
> RMPOLLINTERVAL                  00:00:30
> NODEACCESSPOLICY                SHARED
> ALLOCLOCALITYPOLICY             [NONE]
> SIMTIMEPOLICY                   [NONE]
> ADMIN1                          root
> ADMINHOSTS                      ALL
> NODEPOLLFREQUENCY               0
> DISPLAYFLAGS
> DEFAULTDOMAIN
> DEFAULTCLASSLIST                [DEFAULT:1]
> FEATURENODETYPEHEADER
> FEATUREPROCSPEEDHEADER
> FEATUREPARTITIONHEADER
> DEFERTIME                       1:00:00
> DEFERCOUNT                      24
> DEFERSTARTCOUNT                 1
> JOBPURGETIME                    0
> NODEPURGETIME                   2140000000
> APIFAILURETHRESHHOLD            6
> NODESYNCTIME                    600
> JOBSYNCTIME                     600
> JOBMAXOVERRUN                   00:10:00
> NODEMAXLOAD                     0.0
> PLOTMINTIME                     120
> PLOTMAXTIME                     245760
> PLOTTIMESCALE                   11
> PLOTMINPROC                     1
> PLOTMAXPROC                     512
> PLOTPROCSCALE                   9
> SCHEDCFG[]                        MODE=NORMAL SERVER=*****:42559
>
> # RM MODULES: PBS SSS WIKI NATIVE
> RMCFG[*****] AUTHTYPE=CHECKSUM EPORT=15004 TIMEOUT=00:00:09 TYPE=PBS
> SIMWORKLOADTRACEFILE            workload
> SIMRESOURCETRACEFILE            resource
> SIMAUTOSHUTDOWN                 OFF
> SIMSTARTTIME                    0
> SIMSCALEJOBRUNTIME              FALSE
> SIMFLAGS
> SIMJOBSUBMISSIONPOLICY          CONSTANTJOBDEPTH
> SIMINITIALQUEUEDEPTH            16
> SIMWCACCURACY                   0.00
> SIMWCACCURACYCHANGE             0.00
> SIMNODECOUNT                    0
> SIMNODECONFIGURATION            NORMAL
> SIMWCSCALINGPERCENT             100
> SIMCOMRATE                      0.10
> SIMCOMTYPE                      ROUNDROBIN
> COMINTRAFRAMECOST               0.30
> COMINTERFRAMECOST               0.30
> SIMSTOPITERATION                -1
> SIMEXITITERATION                -1
>
>
> On 27 Mär. 2014, at 20:23 , André Gemünd <andre.gemuend at scai.fraunhofer.de> wrote:
>
>> Hi Nicola,
>>
>> Could you post a checkjob output of a job that is queued? It should have a reason for the job state, e.g. deferred, no resources available or something like that. There can be other resources than cores that inhibit execution, like memory, ncpus, sharing restrictions, etc.
>>
>> Of course, the 'p s' and showconfig output as Gus said would be most helpful, just in case.
>>
>> Greetings
>> Andre
>>
>> ----- Ursprüngliche Mail -----
>>> Hello!
>>>
>>> At our site, we're currently using Torque 4.2.5 with Maui 3.3.1, the
>>> latter with the default configuration. This setup works fine, but
>>> we've noticed that Maui appears to limit the number of jobs per node
>>> to 2: when a user submits 8 jobs that request ppn=1 to a particular
>>> np=8 node, only 2 jobs are started by the scheduler. The remaining 6
>>> jobs must be started via qrun.
>>>
>>> I could not find anything that might be related to this behaviour in
>>> the documentation (links in maui.cfg are broken btw); the showconfig
>>> command doesn't list any parameter with a value of 2, so that didn't
>>> help either.
>>>
>>> Any help/suggestions how to make Maui start more jobs per node are
>>> greatly appreciated.
>>>
>>> Best regards,
>>>
>>>     Nico van Eikema Hommes
>
> --
> Dr. N.J.R. van Eikema Hommes   Computer-Chemie-Centrum
> E-Mail:   nico.hommes at fau.de   Universitaet Erlangen-Nuernberg
> Phone:      +49-9131-8520402   Naegelsbachstr. 25
> FAX:        +49-9131-8520404   91052 Erlangen, Germany
>
>
>
>
>
>
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>



More information about the torqueusers mailing list