[torqueusers] pbs -l procs=n syntax defaults to 1

Glen Beane glen.beane at gmail.com
Mon Dec 16 13:26:36 MST 2013


I thought this had been fixed and procs had been made a real resource in
Torque (meaning it works as expected with qrun or pbs_sched).  I think the
problem here is Maui.


On Mon, Dec 16, 2013 at 2:51 PM, Ken Nielson <knielson at adaptivecomputing.com
> wrote:

> Kevin,
>
> procs is a pass through resource for TORQUE. That is, TORQUE only allows
> it to be accepted because it will hand it to the scheduler and the
> scheduler will interpret the command. Depending on how you have qmgr
> configured the default number of nodes for a job is one with just one proc
> from TORQUE.
>
> You could use -l nodes=x instead. Otherwise, it is up to Maui to interpret
> the meaning of procs.
>
>
> On Mon, Dec 16, 2013 at 11:42 AM, Kevin Sutherland <
> sutherland.kevinr at gmail.com> wrote:
>
>> Greetings,
>>
>> I have posted this on both torque and maui user boards as I am unsure
>> whether the issue is in maui or torque (although we had this same problem
>> before we ran maui)
>>
>> I am configuring a cluster for engineering simulation use at my office.
>> We have two clusters (one with 12 nodes and 16 processors per node and the
>> other is a 5 node cluster with 16 processors per node, except for a bigmem
>> machine with 32 processors).
>>
>> I am only working on the 5 node cluster at this time, but the behavior I
>> am dealing with is on both clusters. When the procs syntax is used, the
>> system is defaulting to 1 process, even though procs is > 1. All nodes show
>> free when issuing qnodes or pbsnodes -a and list the appropriate number of
>> cpus defined in the nodes file.
>>
>> I have a simple test script:
>>
>> #!/bin/bash
>>
>> #PBS -S /bin/bash
>> #PBS -l nodes=2:ppn=8
>> #PBS -j oe
>>
>> cat $PBS_NODEFILE
>>
>> This script prints out:
>>
>> pegasus.am1.mnet
>> pegasus.am1.mnet
>> pegasus.am1.mnet
>> pegasus.am1.mnet
>> pegasus.am1.mnet
>> pegasus.am1.mnet
>> pegasus.am1.mnet
>> pegasus.am1.mnet
>> amdfr1.am1.mnet
>> amdfr1.am1.mnet
>> amdfr1.am1.mnet
>> amdfr1.am1.mnet
>> amdfr1.am1.mnet
>> amdfr1.am1.mnet
>> amdfr1.am1.mnet
>> amdfr1.am1.mnet
>>
>> Which is expected. When I change the PBS resource list to:
>>
>> #PBS -l procs=32
>>
>> I get the following:
>>
>> pegasus.am1.mnet
>>
>> The machine filed created in /var/spool/torque/aux simply has 1 entry for
>> 1 process, even though I requested 32. We have a piece of simulation
>> software that REQUIRES the use of the "-l procs=n" syntax to function on
>> the cluster. (ANSYS does not plan to permit changes to this until Release
>> 16 in 2015) We are trying to use our cluster with Ansys RSM with CFX and
>> Fluent.
>>
>> We are running torque 4.2.6.1 and Maui 3.3.1.
>>
>> My queue and server attributes are defined as follows:
>>
>> #
>> # Create queues and set their attributes.
>> #
>> #
>> # Create and define queue batch
>> #
>> create queue batch
>> set queue batch queue_type = Execution
>> set queue batch resources_default.walltime = 01:00:00
>> set queue batch enabled = True
>> set queue batch started = True
>> #
>> # Set server attributes.
>> #
>> set server scheduling = True
>> set server acl_hosts = titan1.am1.mnet
>> set server managers = kevin at titan1.am1.mnet
>> set server managers += root at titan1.am1.mnet
>> set server operators = kevin at titan1.am1.mnet
>> set server operators += root at titan1.am1.mnet
>> set server default_queue = batch
>> set server log_events = 511
>> set server mail_from = adm
>> set server scheduler_iteration = 600
>> set server node_check_rate = 150
>> set server tcp_timeout = 300
>> set server job_stat_rate = 45
>> set server poll_jobs = True
>> set server mom_job_sync = True
>> set server keep_completed = 300
>> set server submit_hosts = titan1.am1.mnet
>> set server next_job_number = 8
>> set server moab_array_compatible = True
>> set server nppcu = 1
>>
>> My torque nodes file is:
>>
>> titan1.am1.mnet np=16 RAM64GB
>> titan2.am1.mnet np=16 RAM64GB
>> amdfl1.am1.mnet np=16 RAM64GB
>> amdfr1.am1.mnet np=16 RAM64GB
>> pegasus.am1.mnet np=32 RAM128GB
>>
>> Our maui.cfg file is:
>>
>> # maui.cfg 3.3.1
>>
>> SERVERHOST            titan1.am1.mnet
>> # primary admin must be first in list
>> ADMIN1                root kevin
>> ADMIN3              ALL
>>
>> # Resource Manager Definition
>>
>> RMCFG[TITAN1.AM1.MNET] TYPE=PBS
>>
>> # Allocation Manager Definition
>>
>> AMCFG[bank]  TYPE=NONE
>>
>> # full parameter docs at
>> http://supercluster.org/mauidocs/a.fparameters.html
>> # use the 'schedctl -l' command to display current configuration
>>
>> RMPOLLINTERVAL        00:00:30
>>
>> SERVERPORT            42559
>> SERVERMODE            NORMAL
>>
>> # Admin: http://supercluster.org/mauidocs/a.esecurity.html
>>
>>
>> LOGFILE               maui.log
>> LOGFILEMAXSIZE        10000000
>> LOGLEVEL              3
>>
>> # Job Priority:
>> http://supercluster.org/mauidocs/5.1jobprioritization.html
>>
>> QUEUETIMEWEIGHT       1
>>
>> # FairShare: http://supercluster.org/mauidocs/6.3fairshare.html
>>
>> #FSPOLICY              PSDEDICATED
>> #FSDEPTH               7
>> #FSINTERVAL            86400
>> #FSDECAY               0.80
>>
>> # Throttling Policies:
>> http://supercluster.org/mauidocs/6.2throttlingpolicies.html
>>
>> # NONE SPECIFIED
>>
>> # Backfill: http://supercluster.org/mauidocs/8.2backfill.html
>>
>> BACKFILLPOLICY        FIRSTFIT
>> RESERVATIONPOLICY     CURRENTHIGHEST
>>
>> # Node Allocation:
>> http://supercluster.org/mauidocs/5.2nodeallocation.html
>>
>> NODEALLOCATIONPOLICY  MINRESOURCE
>>
>> # Kevin's Modifications:
>>
>> JOBNODEMATCHPOLICY EXACTNODE
>>
>>
>> # QOS: http://supercluster.org/mauidocs/7.3qos.html
>>
>> # QOSCFG[hi]  PRIORITY=100 XFTARGET=100 FLAGS=PREEMPTOR:IGNMAXJOB
>> # QOSCFG[low] PRIORITY=-1000 FLAGS=PREEMPTEE
>>
>> # Standing Reservations:
>> http://supercluster.org/mauidocs/7.1.3standingreservations.html
>>
>> # SRSTARTTIME[test] 8:00:00
>> # SRENDTIME[test]   17:00:00
>> # SRDAYS[test]      MON TUE WED THU FRI
>> # SRTASKCOUNT[test] 20
>> # SRMAXTIME[test]   0:30:00
>>
>> # Creds: http://supercluster.org/mauidocs/6.1fairnessoverview.html
>>
>> # USERCFG[DEFAULT]      FSTARGET=25.0
>> # USERCFG[john]         PRIORITY=100  FSTARGET=10.0-
>> # GROUPCFG[staff]       PRIORITY=1000 QLIST=hi:low QDEF=hi
>> # CLASSCFG[batch]       FLAGS=PREEMPTEE
>> # CLASSCFG[interactive] FLAGS=PREEMPTOR
>>
>> Our MOM config file is:
>>
>> $pbsserver    10.0.0.10    # IP address of titan1.am1.mnet
>> $clienthost    10.0.0.10    # IP address of management node
>> $usecp        *:/home/kevin /home/kevin
>> $usecp        *:/home /home
>> $usecp        *:/root /root
>> $usecp        *:/home/mpi /home/mpi
>> $tmpdir        /home/mpi/tmp
>>
>> I am finding it difficult to identify the configuration issue. I thought
>> this thread would help:
>>
>> http://comments.gmane.org/gmane.comp.clustering.maui.user/2859
>>
>> but their examples show the machine file is working correctly and they
>> are battling memory allocations. I can't seem to get that far yet. Any
>> thoughts?
>>
>> --
>> Kevin Sutherland
>> Simulations Specialist
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>
>
> --
> Ken Nielson
> +1 801.717.3700 office +1 801.717.3738 fax
> 1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
> www.adaptivecomputing.com
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131216/fabc5d68/attachment-0001.html 


More information about the torqueusers mailing list