[torqueusers] pbs -l procs=n syntax defaults to 1

Glen Beane glen.beane at gmail.com
Mon Dec 16 15:00:03 MST 2013


I believe Maui is able to "reinterpret" the resource request and then send
whatever hostlist it wants when it has pbs_server execute the job.  It
doesn't seem to process procs=X and defaults to 1.  I think you would get
the behavior you want with either pbs_sched or Moab.


On Mon, Dec 16, 2013 at 4:46 PM, Kevin Sutherland <
sutherland.kevinr at gmail.com> wrote:

> I was under the same impression as Glen. When we apply the -l nodes=n
> syntax, the system only runs UP TO 5 processes, which is the total number
> of nodes in my nodes file, thus completely ignoring the configuration of
> processors on the nodes.
>
> I removed/unset the resources_default.nodes value in qmgr and restarted
> the server as some folks said it can conflict with the schedulers
> allocation. This made no difference.
>
> I am aware TORQUE applies a default cpu value of 1 when no value is given,
> however with the np designations in the nodes file and the -l procs=n
> values being passed, why is TORQUE still ignoring this?
>
>
> On Mon, Dec 16, 2013 at 1:26 PM, Glen Beane <glen.beane at gmail.com> wrote:
>
>> I thought this had been fixed and procs had been made a real resource in
>> Torque (meaning it works as expected with qrun or pbs_sched).  I think the
>> problem here is Maui.
>>
>>
>> On Mon, Dec 16, 2013 at 2:51 PM, Ken Nielson <
>> knielson at adaptivecomputing.com> wrote:
>>
>>> Kevin,
>>>
>>> procs is a pass through resource for TORQUE. That is, TORQUE only allows
>>> it to be accepted because it will hand it to the scheduler and the
>>> scheduler will interpret the command. Depending on how you have qmgr
>>> configured the default number of nodes for a job is one with just one proc
>>> from TORQUE.
>>>
>>> You could use -l nodes=x instead. Otherwise, it is up to Maui to
>>> interpret the meaning of procs.
>>>
>>>
>>> On Mon, Dec 16, 2013 at 11:42 AM, Kevin Sutherland <
>>> sutherland.kevinr at gmail.com> wrote:
>>>
>>>> Greetings,
>>>>
>>>> I have posted this on both torque and maui user boards as I am unsure
>>>> whether the issue is in maui or torque (although we had this same problem
>>>> before we ran maui)
>>>>
>>>> I am configuring a cluster for engineering simulation use at my office.
>>>> We have two clusters (one with 12 nodes and 16 processors per node and the
>>>> other is a 5 node cluster with 16 processors per node, except for a bigmem
>>>> machine with 32 processors).
>>>>
>>>> I am only working on the 5 node cluster at this time, but the behavior
>>>> I am dealing with is on both clusters. When the procs syntax is used, the
>>>> system is defaulting to 1 process, even though procs is > 1. All nodes show
>>>> free when issuing qnodes or pbsnodes -a and list the appropriate number of
>>>> cpus defined in the nodes file.
>>>>
>>>> I have a simple test script:
>>>>
>>>> #!/bin/bash
>>>>
>>>> #PBS -S /bin/bash
>>>> #PBS -l nodes=2:ppn=8
>>>> #PBS -j oe
>>>>
>>>> cat $PBS_NODEFILE
>>>>
>>>> This script prints out:
>>>>
>>>> pegasus.am1.mnet
>>>> pegasus.am1.mnet
>>>> pegasus.am1.mnet
>>>> pegasus.am1.mnet
>>>> pegasus.am1.mnet
>>>> pegasus.am1.mnet
>>>> pegasus.am1.mnet
>>>> pegasus.am1.mnet
>>>> amdfr1.am1.mnet
>>>> amdfr1.am1.mnet
>>>> amdfr1.am1.mnet
>>>> amdfr1.am1.mnet
>>>> amdfr1.am1.mnet
>>>> amdfr1.am1.mnet
>>>> amdfr1.am1.mnet
>>>> amdfr1.am1.mnet
>>>>
>>>> Which is expected. When I change the PBS resource list to:
>>>>
>>>> #PBS -l procs=32
>>>>
>>>> I get the following:
>>>>
>>>> pegasus.am1.mnet
>>>>
>>>> The machine filed created in /var/spool/torque/aux simply has 1 entry
>>>> for 1 process, even though I requested 32. We have a piece of simulation
>>>> software that REQUIRES the use of the "-l procs=n" syntax to function on
>>>> the cluster. (ANSYS does not plan to permit changes to this until Release
>>>> 16 in 2015) We are trying to use our cluster with Ansys RSM with CFX and
>>>> Fluent.
>>>>
>>>> We are running torque 4.2.6.1 and Maui 3.3.1.
>>>>
>>>> My queue and server attributes are defined as follows:
>>>>
>>>> #
>>>> # Create queues and set their attributes.
>>>> #
>>>> #
>>>> # Create and define queue batch
>>>> #
>>>> create queue batch
>>>> set queue batch queue_type = Execution
>>>> set queue batch resources_default.walltime = 01:00:00
>>>> set queue batch enabled = True
>>>> set queue batch started = True
>>>> #
>>>> # Set server attributes.
>>>> #
>>>> set server scheduling = True
>>>> set server acl_hosts = titan1.am1.mnet
>>>> set server managers = kevin at titan1.am1.mnet
>>>> set server managers += root at titan1.am1.mnet
>>>> set server operators = kevin at titan1.am1.mnet
>>>> set server operators += root at titan1.am1.mnet
>>>> set server default_queue = batch
>>>> set server log_events = 511
>>>> set server mail_from = adm
>>>> set server scheduler_iteration = 600
>>>> set server node_check_rate = 150
>>>> set server tcp_timeout = 300
>>>> set server job_stat_rate = 45
>>>> set server poll_jobs = True
>>>> set server mom_job_sync = True
>>>> set server keep_completed = 300
>>>> set server submit_hosts = titan1.am1.mnet
>>>> set server next_job_number = 8
>>>> set server moab_array_compatible = True
>>>> set server nppcu = 1
>>>>
>>>> My torque nodes file is:
>>>>
>>>> titan1.am1.mnet np=16 RAM64GB
>>>> titan2.am1.mnet np=16 RAM64GB
>>>> amdfl1.am1.mnet np=16 RAM64GB
>>>> amdfr1.am1.mnet np=16 RAM64GB
>>>> pegasus.am1.mnet np=32 RAM128GB
>>>>
>>>> Our maui.cfg file is:
>>>>
>>>> # maui.cfg 3.3.1
>>>>
>>>> SERVERHOST            titan1.am1.mnet
>>>> # primary admin must be first in list
>>>> ADMIN1                root kevin
>>>> ADMIN3              ALL
>>>>
>>>> # Resource Manager Definition
>>>>
>>>> RMCFG[TITAN1.AM1.MNET] TYPE=PBS
>>>>
>>>> # Allocation Manager Definition
>>>>
>>>> AMCFG[bank]  TYPE=NONE
>>>>
>>>> # full parameter docs at
>>>> http://supercluster.org/mauidocs/a.fparameters.html
>>>> # use the 'schedctl -l' command to display current configuration
>>>>
>>>> RMPOLLINTERVAL        00:00:30
>>>>
>>>> SERVERPORT            42559
>>>> SERVERMODE            NORMAL
>>>>
>>>> # Admin: http://supercluster.org/mauidocs/a.esecurity.html
>>>>
>>>>
>>>> LOGFILE               maui.log
>>>> LOGFILEMAXSIZE        10000000
>>>> LOGLEVEL              3
>>>>
>>>> # Job Priority:
>>>> http://supercluster.org/mauidocs/5.1jobprioritization.html
>>>>
>>>> QUEUETIMEWEIGHT       1
>>>>
>>>> # FairShare: http://supercluster.org/mauidocs/6.3fairshare.html
>>>>
>>>> #FSPOLICY              PSDEDICATED
>>>> #FSDEPTH               7
>>>> #FSINTERVAL            86400
>>>> #FSDECAY               0.80
>>>>
>>>> # Throttling Policies:
>>>> http://supercluster.org/mauidocs/6.2throttlingpolicies.html
>>>>
>>>> # NONE SPECIFIED
>>>>
>>>> # Backfill: http://supercluster.org/mauidocs/8.2backfill.html
>>>>
>>>> BACKFILLPOLICY        FIRSTFIT
>>>> RESERVATIONPOLICY     CURRENTHIGHEST
>>>>
>>>> # Node Allocation:
>>>> http://supercluster.org/mauidocs/5.2nodeallocation.html
>>>>
>>>> NODEALLOCATIONPOLICY  MINRESOURCE
>>>>
>>>> # Kevin's Modifications:
>>>>
>>>> JOBNODEMATCHPOLICY EXACTNODE
>>>>
>>>>
>>>> # QOS: http://supercluster.org/mauidocs/7.3qos.html
>>>>
>>>> # QOSCFG[hi]  PRIORITY=100 XFTARGET=100 FLAGS=PREEMPTOR:IGNMAXJOB
>>>> # QOSCFG[low] PRIORITY=-1000 FLAGS=PREEMPTEE
>>>>
>>>> # Standing Reservations:
>>>> http://supercluster.org/mauidocs/7.1.3standingreservations.html
>>>>
>>>> # SRSTARTTIME[test] 8:00:00
>>>> # SRENDTIME[test]   17:00:00
>>>> # SRDAYS[test]      MON TUE WED THU FRI
>>>> # SRTASKCOUNT[test] 20
>>>> # SRMAXTIME[test]   0:30:00
>>>>
>>>> # Creds: http://supercluster.org/mauidocs/6.1fairnessoverview.html
>>>>
>>>> # USERCFG[DEFAULT]      FSTARGET=25.0
>>>> # USERCFG[john]         PRIORITY=100  FSTARGET=10.0-
>>>> # GROUPCFG[staff]       PRIORITY=1000 QLIST=hi:low QDEF=hi
>>>> # CLASSCFG[batch]       FLAGS=PREEMPTEE
>>>> # CLASSCFG[interactive] FLAGS=PREEMPTOR
>>>>
>>>> Our MOM config file is:
>>>>
>>>> $pbsserver    10.0.0.10    # IP address of titan1.am1.mnet
>>>> $clienthost    10.0.0.10    # IP address of management node
>>>> $usecp        *:/home/kevin /home/kevin
>>>> $usecp        *:/home /home
>>>> $usecp        *:/root /root
>>>> $usecp        *:/home/mpi /home/mpi
>>>> $tmpdir        /home/mpi/tmp
>>>>
>>>> I am finding it difficult to identify the configuration issue. I
>>>> thought this thread would help:
>>>>
>>>> http://comments.gmane.org/gmane.comp.clustering.maui.user/2859
>>>>
>>>> but their examples show the machine file is working correctly and they
>>>> are battling memory allocations. I can't seem to get that far yet. Any
>>>> thoughts?
>>>>
>>>> --
>>>> Kevin Sutherland
>>>> Simulations Specialist
>>>>
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>
>>>>
>>>
>>>
>>> --
>>> Ken Nielson
>>> +1 801.717.3700 office +1 801.717.3738 fax
>>> 1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
>>> www.adaptivecomputing.com
>>>
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>
>
> --
> Kevin Sutherland
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131216/44b792c7/attachment-0001.html 


More information about the torqueusers mailing list