[torqueusers] pbs -l procs=n syntax defaults to 1

Glen Beane glen.beane at gmail.com
Wed Dec 18 10:40:59 MST 2013


Torque 2.5.0 CHANGELOG entry:

 e - Enabled TORQUE to be able to parse the -l procs=x node spec. Previously
      TORQUE simply recored the value of x for procs in Resources_List. It
      now takes that value and allocates x processors packed on any
available
      node. (Ken Nielson Adaptive Computing. June 17, 2010)





On Wed, Dec 18, 2013 at 12:33 PM, Glen Beane <glen.beane at gmail.com> wrote:

> Ken,
>
> this seems like a regression then.
>
>
> On Wed, Dec 18, 2013 at 10:57 AM, Ken Nielson <
> knielson at adaptivecomputing.com> wrote:
>
>> I am flip flopping. Back to what I originally said about pass through
>> directives. procs is a pass through. TORQUE ignores it. It will allow you
>> to submit the job but when you do a qrun you will be one node with one core
>> to run the job. The scheduler is the one to interpret the meaning of procs.
>> In the case of Moab it means give me x cores anywhere you can.
>>
>>
>>
>>
>> On Tue, Dec 17, 2013 at 4:53 PM, Kevin Sutherland <
>> sutherland.kevinr at gmail.com> wrote:
>>
>>> To Ken's response...I would like to know how? The issue may be in our
>>> queue config (I attached this in my first post). I am unsure how to get the
>>> procs syntax working. We had the issue without Maui in the mix as well.
>>> Does anyone have a working setup with the procs syntax they could walk
>>> through with me? Even if it's just copies of the config files with the
>>> pertinent syntax described...we would really like to avoid going the
>>> commercial route if we can avoid it (HPC Suite from Adaptive Computing).
>>>
>>> Thanks,
>>> -Kevin
>>>
>>>
>>> On Tue, Dec 17, 2013 at 10:14 AM, Ken Nielson <
>>> knielson at adaptivecomputing.com> wrote:
>>>
>>>> Glen,
>>>>
>>>> You are right. My mistake. procs does work.
>>>>
>>>> Ken
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Dec 16, 2013 at 1:26 PM, Glen Beane <glen.beane at gmail.com>wrote:
>>>>
>>>>> I thought this had been fixed and procs had been made a real resource
>>>>> in Torque (meaning it works as expected with qrun or pbs_sched).  I think
>>>>> the problem here is Maui.
>>>>>
>>>>>
>>>>> On Mon, Dec 16, 2013 at 2:51 PM, Ken Nielson <
>>>>> knielson at adaptivecomputing.com> wrote:
>>>>>
>>>>>> Kevin,
>>>>>>
>>>>>> procs is a pass through resource for TORQUE. That is, TORQUE only
>>>>>> allows it to be accepted because it will hand it to the scheduler and the
>>>>>> scheduler will interpret the command. Depending on how you have qmgr
>>>>>> configured the default number of nodes for a job is one with just one proc
>>>>>> from TORQUE.
>>>>>>
>>>>>> You could use -l nodes=x instead. Otherwise, it is up to Maui to
>>>>>> interpret the meaning of procs.
>>>>>>
>>>>>>
>>>>>> On Mon, Dec 16, 2013 at 11:42 AM, Kevin Sutherland <
>>>>>> sutherland.kevinr at gmail.com> wrote:
>>>>>>
>>>>>>> Greetings,
>>>>>>>
>>>>>>> I have posted this on both torque and maui user boards as I am
>>>>>>> unsure whether the issue is in maui or torque (although we had this same
>>>>>>> problem before we ran maui)
>>>>>>>
>>>>>>> I am configuring a cluster for engineering simulation use at my
>>>>>>> office. We have two clusters (one with 12 nodes and 16 processors per node
>>>>>>> and the other is a 5 node cluster with 16 processors per node, except for a
>>>>>>> bigmem machine with 32 processors).
>>>>>>>
>>>>>>> I am only working on the 5 node cluster at this time, but the
>>>>>>> behavior I am dealing with is on both clusters. When the procs syntax is
>>>>>>> used, the system is defaulting to 1 process, even though procs is > 1. All
>>>>>>> nodes show free when issuing qnodes or pbsnodes -a and list the appropriate
>>>>>>> number of cpus defined in the nodes file.
>>>>>>>
>>>>>>> I have a simple test script:
>>>>>>>
>>>>>>> #!/bin/bash
>>>>>>>
>>>>>>> #PBS -S /bin/bash
>>>>>>> #PBS -l nodes=2:ppn=8
>>>>>>> #PBS -j oe
>>>>>>>
>>>>>>> cat $PBS_NODEFILE
>>>>>>>
>>>>>>> This script prints out:
>>>>>>>
>>>>>>> pegasus.am1.mnet
>>>>>>> pegasus.am1.mnet
>>>>>>> pegasus.am1.mnet
>>>>>>> pegasus.am1.mnet
>>>>>>> pegasus.am1.mnet
>>>>>>> pegasus.am1.mnet
>>>>>>> pegasus.am1.mnet
>>>>>>> pegasus.am1.mnet
>>>>>>> amdfr1.am1.mnet
>>>>>>> amdfr1.am1.mnet
>>>>>>> amdfr1.am1.mnet
>>>>>>> amdfr1.am1.mnet
>>>>>>> amdfr1.am1.mnet
>>>>>>> amdfr1.am1.mnet
>>>>>>> amdfr1.am1.mnet
>>>>>>> amdfr1.am1.mnet
>>>>>>>
>>>>>>> Which is expected. When I change the PBS resource list to:
>>>>>>>
>>>>>>> #PBS -l procs=32
>>>>>>>
>>>>>>> I get the following:
>>>>>>>
>>>>>>> pegasus.am1.mnet
>>>>>>>
>>>>>>> The machine filed created in /var/spool/torque/aux simply has 1
>>>>>>> entry for 1 process, even though I requested 32. We have a piece of
>>>>>>> simulation software that REQUIRES the use of the "-l procs=n" syntax to
>>>>>>> function on the cluster. (ANSYS does not plan to permit changes to this
>>>>>>> until Release 16 in 2015) We are trying to use our cluster with Ansys RSM
>>>>>>> with CFX and Fluent.
>>>>>>>
>>>>>>> We are running torque 4.2.6.1 and Maui 3.3.1.
>>>>>>>
>>>>>>> My queue and server attributes are defined as follows:
>>>>>>>
>>>>>>> #
>>>>>>> # Create queues and set their attributes.
>>>>>>> #
>>>>>>> #
>>>>>>> # Create and define queue batch
>>>>>>> #
>>>>>>> create queue batch
>>>>>>> set queue batch queue_type = Execution
>>>>>>> set queue batch resources_default.walltime = 01:00:00
>>>>>>> set queue batch enabled = True
>>>>>>> set queue batch started = True
>>>>>>> #
>>>>>>> # Set server attributes.
>>>>>>> #
>>>>>>> set server scheduling = True
>>>>>>> set server acl_hosts = titan1.am1.mnet
>>>>>>> set server managers = kevin at titan1.am1.mnet
>>>>>>> set server managers += root at titan1.am1.mnet
>>>>>>> set server operators = kevin at titan1.am1.mnet
>>>>>>> set server operators += root at titan1.am1.mnet
>>>>>>> set server default_queue = batch
>>>>>>> set server log_events = 511
>>>>>>> set server mail_from = adm
>>>>>>> set server scheduler_iteration = 600
>>>>>>> set server node_check_rate = 150
>>>>>>> set server tcp_timeout = 300
>>>>>>> set server job_stat_rate = 45
>>>>>>> set server poll_jobs = True
>>>>>>> set server mom_job_sync = True
>>>>>>> set server keep_completed = 300
>>>>>>> set server submit_hosts = titan1.am1.mnet
>>>>>>> set server next_job_number = 8
>>>>>>> set server moab_array_compatible = True
>>>>>>> set server nppcu = 1
>>>>>>>
>>>>>>> My torque nodes file is:
>>>>>>>
>>>>>>> titan1.am1.mnet np=16 RAM64GB
>>>>>>> titan2.am1.mnet np=16 RAM64GB
>>>>>>> amdfl1.am1.mnet np=16 RAM64GB
>>>>>>> amdfr1.am1.mnet np=16 RAM64GB
>>>>>>> pegasus.am1.mnet np=32 RAM128GB
>>>>>>>
>>>>>>> Our maui.cfg file is:
>>>>>>>
>>>>>>> # maui.cfg 3.3.1
>>>>>>>
>>>>>>> SERVERHOST            titan1.am1.mnet
>>>>>>> # primary admin must be first in list
>>>>>>> ADMIN1                root kevin
>>>>>>> ADMIN3              ALL
>>>>>>>
>>>>>>> # Resource Manager Definition
>>>>>>>
>>>>>>> RMCFG[TITAN1.AM1.MNET] TYPE=PBS
>>>>>>>
>>>>>>> # Allocation Manager Definition
>>>>>>>
>>>>>>> AMCFG[bank]  TYPE=NONE
>>>>>>>
>>>>>>> # full parameter docs at
>>>>>>> http://supercluster.org/mauidocs/a.fparameters.html
>>>>>>> # use the 'schedctl -l' command to display current configuration
>>>>>>>
>>>>>>> RMPOLLINTERVAL        00:00:30
>>>>>>>
>>>>>>> SERVERPORT            42559
>>>>>>> SERVERMODE            NORMAL
>>>>>>>
>>>>>>> # Admin: http://supercluster.org/mauidocs/a.esecurity.html
>>>>>>>
>>>>>>>
>>>>>>> LOGFILE               maui.log
>>>>>>> LOGFILEMAXSIZE        10000000
>>>>>>> LOGLEVEL              3
>>>>>>>
>>>>>>> # Job Priority:
>>>>>>> http://supercluster.org/mauidocs/5.1jobprioritization.html
>>>>>>>
>>>>>>> QUEUETIMEWEIGHT       1
>>>>>>>
>>>>>>> # FairShare: http://supercluster.org/mauidocs/6.3fairshare.html
>>>>>>>
>>>>>>> #FSPOLICY              PSDEDICATED
>>>>>>> #FSDEPTH               7
>>>>>>> #FSINTERVAL            86400
>>>>>>> #FSDECAY               0.80
>>>>>>>
>>>>>>> # Throttling Policies:
>>>>>>> http://supercluster.org/mauidocs/6.2throttlingpolicies.html
>>>>>>>
>>>>>>> # NONE SPECIFIED
>>>>>>>
>>>>>>> # Backfill: http://supercluster.org/mauidocs/8.2backfill.html
>>>>>>>
>>>>>>> BACKFILLPOLICY        FIRSTFIT
>>>>>>> RESERVATIONPOLICY     CURRENTHIGHEST
>>>>>>>
>>>>>>> # Node Allocation:
>>>>>>> http://supercluster.org/mauidocs/5.2nodeallocation.html
>>>>>>>
>>>>>>> NODEALLOCATIONPOLICY  MINRESOURCE
>>>>>>>
>>>>>>> # Kevin's Modifications:
>>>>>>>
>>>>>>> JOBNODEMATCHPOLICY EXACTNODE
>>>>>>>
>>>>>>>
>>>>>>> # QOS: http://supercluster.org/mauidocs/7.3qos.html
>>>>>>>
>>>>>>> # QOSCFG[hi]  PRIORITY=100 XFTARGET=100 FLAGS=PREEMPTOR:IGNMAXJOB
>>>>>>> # QOSCFG[low] PRIORITY=-1000 FLAGS=PREEMPTEE
>>>>>>>
>>>>>>> # Standing Reservations:
>>>>>>> http://supercluster.org/mauidocs/7.1.3standingreservations.html
>>>>>>>
>>>>>>> # SRSTARTTIME[test] 8:00:00
>>>>>>> # SRENDTIME[test]   17:00:00
>>>>>>> # SRDAYS[test]      MON TUE WED THU FRI
>>>>>>> # SRTASKCOUNT[test] 20
>>>>>>> # SRMAXTIME[test]   0:30:00
>>>>>>>
>>>>>>> # Creds: http://supercluster.org/mauidocs/6.1fairnessoverview.html
>>>>>>>
>>>>>>> # USERCFG[DEFAULT]      FSTARGET=25.0
>>>>>>> # USERCFG[john]         PRIORITY=100  FSTARGET=10.0-
>>>>>>> # GROUPCFG[staff]       PRIORITY=1000 QLIST=hi:low QDEF=hi
>>>>>>> # CLASSCFG[batch]       FLAGS=PREEMPTEE
>>>>>>> # CLASSCFG[interactive] FLAGS=PREEMPTOR
>>>>>>>
>>>>>>> Our MOM config file is:
>>>>>>>
>>>>>>> $pbsserver    10.0.0.10    # IP address of titan1.am1.mnet
>>>>>>> $clienthost    10.0.0.10    # IP address of management node
>>>>>>> $usecp        *:/home/kevin /home/kevin
>>>>>>> $usecp        *:/home /home
>>>>>>> $usecp        *:/root /root
>>>>>>> $usecp        *:/home/mpi /home/mpi
>>>>>>> $tmpdir        /home/mpi/tmp
>>>>>>>
>>>>>>> I am finding it difficult to identify the configuration issue. I
>>>>>>> thought this thread would help:
>>>>>>>
>>>>>>> http://comments.gmane.org/gmane.comp.clustering.maui.user/2859
>>>>>>>
>>>>>>> but their examples show the machine file is working correctly and
>>>>>>> they are battling memory allocations. I can't seem to get that far yet. Any
>>>>>>> thoughts?
>>>>>>>
>>>>>>> --
>>>>>>> Kevin Sutherland
>>>>>>> Simulations Specialist
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> torqueusers mailing list
>>>>>>> torqueusers at supercluster.org
>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ken Nielson
>>>>>> +1 801.717.3700 office +1 801.717.3738 fax
>>>>>> 1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
>>>>>> www.adaptivecomputing.com
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> torqueusers mailing list
>>>>>> torqueusers at supercluster.org
>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> torqueusers mailing list
>>>>> torqueusers at supercluster.org
>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Ken Nielson
>>>> +1 801.717.3700 office +1 801.717.3738 fax
>>>> 1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
>>>> www.adaptivecomputing.com
>>>>
>>>>
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>
>>>>
>>>
>>>
>>> --
>>> Kevin Sutherland
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>
>>
>> --
>> Ken Nielson
>> +1 801.717.3700 office +1 801.717.3738 fax
>> 1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
>> www.adaptivecomputing.com
>>
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131218/3f1d3ddb/attachment-0001.html 


More information about the torqueusers mailing list