[torqueusers] pbs -l procs=n syntax defaults to 1

Kevin Sutherland sutherland.kevinr at gmail.com
Mon Dec 16 14:46:57 MST 2013


I was under the same impression as Glen. When we apply the -l nodes=n
syntax, the system only runs UP TO 5 processes, which is the total number
of nodes in my nodes file, thus completely ignoring the configuration of
processors on the nodes.

I removed/unset the resources_default.nodes value in qmgr and restarted the
server as some folks said it can conflict with the schedulers allocation.
This made no difference.

I am aware TORQUE applies a default cpu value of 1 when no value is given,
however with the np designations in the nodes file and the -l procs=n
values being passed, why is TORQUE still ignoring this?


On Mon, Dec 16, 2013 at 1:26 PM, Glen Beane <glen.beane at gmail.com> wrote:

> I thought this had been fixed and procs had been made a real resource in
> Torque (meaning it works as expected with qrun or pbs_sched).  I think the
> problem here is Maui.
>
>
> On Mon, Dec 16, 2013 at 2:51 PM, Ken Nielson <
> knielson at adaptivecomputing.com> wrote:
>
>> Kevin,
>>
>> procs is a pass through resource for TORQUE. That is, TORQUE only allows
>> it to be accepted because it will hand it to the scheduler and the
>> scheduler will interpret the command. Depending on how you have qmgr
>> configured the default number of nodes for a job is one with just one proc
>> from TORQUE.
>>
>> You could use -l nodes=x instead. Otherwise, it is up to Maui to
>> interpret the meaning of procs.
>>
>>
>> On Mon, Dec 16, 2013 at 11:42 AM, Kevin Sutherland <
>> sutherland.kevinr at gmail.com> wrote:
>>
>>> Greetings,
>>>
>>> I have posted this on both torque and maui user boards as I am unsure
>>> whether the issue is in maui or torque (although we had this same problem
>>> before we ran maui)
>>>
>>> I am configuring a cluster for engineering simulation use at my office.
>>> We have two clusters (one with 12 nodes and 16 processors per node and the
>>> other is a 5 node cluster with 16 processors per node, except for a bigmem
>>> machine with 32 processors).
>>>
>>> I am only working on the 5 node cluster at this time, but the behavior I
>>> am dealing with is on both clusters. When the procs syntax is used, the
>>> system is defaulting to 1 process, even though procs is > 1. All nodes show
>>> free when issuing qnodes or pbsnodes -a and list the appropriate number of
>>> cpus defined in the nodes file.
>>>
>>> I have a simple test script:
>>>
>>> #!/bin/bash
>>>
>>> #PBS -S /bin/bash
>>> #PBS -l nodes=2:ppn=8
>>> #PBS -j oe
>>>
>>> cat $PBS_NODEFILE
>>>
>>> This script prints out:
>>>
>>> pegasus.am1.mnet
>>> pegasus.am1.mnet
>>> pegasus.am1.mnet
>>> pegasus.am1.mnet
>>> pegasus.am1.mnet
>>> pegasus.am1.mnet
>>> pegasus.am1.mnet
>>> pegasus.am1.mnet
>>> amdfr1.am1.mnet
>>> amdfr1.am1.mnet
>>> amdfr1.am1.mnet
>>> amdfr1.am1.mnet
>>> amdfr1.am1.mnet
>>> amdfr1.am1.mnet
>>> amdfr1.am1.mnet
>>> amdfr1.am1.mnet
>>>
>>> Which is expected. When I change the PBS resource list to:
>>>
>>> #PBS -l procs=32
>>>
>>> I get the following:
>>>
>>> pegasus.am1.mnet
>>>
>>> The machine filed created in /var/spool/torque/aux simply has 1 entry
>>> for 1 process, even though I requested 32. We have a piece of simulation
>>> software that REQUIRES the use of the "-l procs=n" syntax to function on
>>> the cluster. (ANSYS does not plan to permit changes to this until Release
>>> 16 in 2015) We are trying to use our cluster with Ansys RSM with CFX and
>>> Fluent.
>>>
>>> We are running torque 4.2.6.1 and Maui 3.3.1.
>>>
>>> My queue and server attributes are defined as follows:
>>>
>>> #
>>> # Create queues and set their attributes.
>>> #
>>> #
>>> # Create and define queue batch
>>> #
>>> create queue batch
>>> set queue batch queue_type = Execution
>>> set queue batch resources_default.walltime = 01:00:00
>>> set queue batch enabled = True
>>> set queue batch started = True
>>> #
>>> # Set server attributes.
>>> #
>>> set server scheduling = True
>>> set server acl_hosts = titan1.am1.mnet
>>> set server managers = kevin at titan1.am1.mnet
>>> set server managers += root at titan1.am1.mnet
>>> set server operators = kevin at titan1.am1.mnet
>>> set server operators += root at titan1.am1.mnet
>>> set server default_queue = batch
>>> set server log_events = 511
>>> set server mail_from = adm
>>> set server scheduler_iteration = 600
>>> set server node_check_rate = 150
>>> set server tcp_timeout = 300
>>> set server job_stat_rate = 45
>>> set server poll_jobs = True
>>> set server mom_job_sync = True
>>> set server keep_completed = 300
>>> set server submit_hosts = titan1.am1.mnet
>>> set server next_job_number = 8
>>> set server moab_array_compatible = True
>>> set server nppcu = 1
>>>
>>> My torque nodes file is:
>>>
>>> titan1.am1.mnet np=16 RAM64GB
>>> titan2.am1.mnet np=16 RAM64GB
>>> amdfl1.am1.mnet np=16 RAM64GB
>>> amdfr1.am1.mnet np=16 RAM64GB
>>> pegasus.am1.mnet np=32 RAM128GB
>>>
>>> Our maui.cfg file is:
>>>
>>> # maui.cfg 3.3.1
>>>
>>> SERVERHOST            titan1.am1.mnet
>>> # primary admin must be first in list
>>> ADMIN1                root kevin
>>> ADMIN3              ALL
>>>
>>> # Resource Manager Definition
>>>
>>> RMCFG[TITAN1.AM1.MNET] TYPE=PBS
>>>
>>> # Allocation Manager Definition
>>>
>>> AMCFG[bank]  TYPE=NONE
>>>
>>> # full parameter docs at
>>> http://supercluster.org/mauidocs/a.fparameters.html
>>> # use the 'schedctl -l' command to display current configuration
>>>
>>> RMPOLLINTERVAL        00:00:30
>>>
>>> SERVERPORT            42559
>>> SERVERMODE            NORMAL
>>>
>>> # Admin: http://supercluster.org/mauidocs/a.esecurity.html
>>>
>>>
>>> LOGFILE               maui.log
>>> LOGFILEMAXSIZE        10000000
>>> LOGLEVEL              3
>>>
>>> # Job Priority:
>>> http://supercluster.org/mauidocs/5.1jobprioritization.html
>>>
>>> QUEUETIMEWEIGHT       1
>>>
>>> # FairShare: http://supercluster.org/mauidocs/6.3fairshare.html
>>>
>>> #FSPOLICY              PSDEDICATED
>>> #FSDEPTH               7
>>> #FSINTERVAL            86400
>>> #FSDECAY               0.80
>>>
>>> # Throttling Policies:
>>> http://supercluster.org/mauidocs/6.2throttlingpolicies.html
>>>
>>> # NONE SPECIFIED
>>>
>>> # Backfill: http://supercluster.org/mauidocs/8.2backfill.html
>>>
>>> BACKFILLPOLICY        FIRSTFIT
>>> RESERVATIONPOLICY     CURRENTHIGHEST
>>>
>>> # Node Allocation:
>>> http://supercluster.org/mauidocs/5.2nodeallocation.html
>>>
>>> NODEALLOCATIONPOLICY  MINRESOURCE
>>>
>>> # Kevin's Modifications:
>>>
>>> JOBNODEMATCHPOLICY EXACTNODE
>>>
>>>
>>> # QOS: http://supercluster.org/mauidocs/7.3qos.html
>>>
>>> # QOSCFG[hi]  PRIORITY=100 XFTARGET=100 FLAGS=PREEMPTOR:IGNMAXJOB
>>> # QOSCFG[low] PRIORITY=-1000 FLAGS=PREEMPTEE
>>>
>>> # Standing Reservations:
>>> http://supercluster.org/mauidocs/7.1.3standingreservations.html
>>>
>>> # SRSTARTTIME[test] 8:00:00
>>> # SRENDTIME[test]   17:00:00
>>> # SRDAYS[test]      MON TUE WED THU FRI
>>> # SRTASKCOUNT[test] 20
>>> # SRMAXTIME[test]   0:30:00
>>>
>>> # Creds: http://supercluster.org/mauidocs/6.1fairnessoverview.html
>>>
>>> # USERCFG[DEFAULT]      FSTARGET=25.0
>>> # USERCFG[john]         PRIORITY=100  FSTARGET=10.0-
>>> # GROUPCFG[staff]       PRIORITY=1000 QLIST=hi:low QDEF=hi
>>> # CLASSCFG[batch]       FLAGS=PREEMPTEE
>>> # CLASSCFG[interactive] FLAGS=PREEMPTOR
>>>
>>> Our MOM config file is:
>>>
>>> $pbsserver    10.0.0.10    # IP address of titan1.am1.mnet
>>> $clienthost    10.0.0.10    # IP address of management node
>>> $usecp        *:/home/kevin /home/kevin
>>> $usecp        *:/home /home
>>> $usecp        *:/root /root
>>> $usecp        *:/home/mpi /home/mpi
>>> $tmpdir        /home/mpi/tmp
>>>
>>> I am finding it difficult to identify the configuration issue. I thought
>>> this thread would help:
>>>
>>> http://comments.gmane.org/gmane.comp.clustering.maui.user/2859
>>>
>>> but their examples show the machine file is working correctly and they
>>> are battling memory allocations. I can't seem to get that far yet. Any
>>> thoughts?
>>>
>>> --
>>> Kevin Sutherland
>>> Simulations Specialist
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>
>>
>> --
>> Ken Nielson
>> +1 801.717.3700 office +1 801.717.3738 fax
>> 1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
>> www.adaptivecomputing.com
>>
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>


-- 
Kevin Sutherland
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131216/0e4dd06b/attachment.html 


More information about the torqueusers mailing list