[torqueusers] pbs -l procs=n syntax defaults to 1

Glen Beane glen.beane at gmail.com
Wed Dec 18 11:00:08 MST 2013


I'm not exactly sure which version introduced the regression.   I guess you
have a few choices:  go back to a version of Torque that works as you
expect (maybe the last 2.X version),  use a scheduler that understands
procs (or modify a scheduler to do so),  or wait for Torque to be fixed.


On Wed, Dec 18, 2013 at 12:49 PM, Kevin Sutherland <
sutherland.kevinr at gmail.com> wrote:

> Wait, so do we have to go back to version 2.5.0 to get this to work
> correctly or has it broken between 2.5.0 and 4.2.6.1?
>
>
> On Wed, Dec 18, 2013 at 10:44 AM, Ken Nielson <
> knielson at adaptivecomputing.com> wrote:
>
>> Well, there it is. It looks like we have a regression.
>>
>>
>> On Wed, Dec 18, 2013 at 10:40 AM, Glen Beane <glen.beane at gmail.com>wrote:
>>
>>> Torque 2.5.0 CHANGELOG entry:
>>>
>>>  e - Enabled TORQUE to be able to parse the -l procs=x node spec.
>>> Previously
>>>       TORQUE simply recored the value of x for procs in Resources_List.
>>> It
>>>       now takes that value and allocates x processors packed on any
>>> available
>>>       node. (Ken Nielson Adaptive Computing. June 17, 2010)
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Dec 18, 2013 at 12:33 PM, Glen Beane <glen.beane at gmail.com>wrote:
>>>
>>>> Ken,
>>>>
>>>> this seems like a regression then.
>>>>
>>>>
>>>> On Wed, Dec 18, 2013 at 10:57 AM, Ken Nielson <
>>>> knielson at adaptivecomputing.com> wrote:
>>>>
>>>>> I am flip flopping. Back to what I originally said about pass through
>>>>> directives. procs is a pass through. TORQUE ignores it. It will allow you
>>>>> to submit the job but when you do a qrun you will be one node with one core
>>>>> to run the job. The scheduler is the one to interpret the meaning of procs.
>>>>> In the case of Moab it means give me x cores anywhere you can.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Dec 17, 2013 at 4:53 PM, Kevin Sutherland <
>>>>> sutherland.kevinr at gmail.com> wrote:
>>>>>
>>>>>> To Ken's response...I would like to know how? The issue may be in our
>>>>>> queue config (I attached this in my first post). I am unsure how to get the
>>>>>> procs syntax working. We had the issue without Maui in the mix as well.
>>>>>> Does anyone have a working setup with the procs syntax they could walk
>>>>>> through with me? Even if it's just copies of the config files with the
>>>>>> pertinent syntax described...we would really like to avoid going the
>>>>>> commercial route if we can avoid it (HPC Suite from Adaptive Computing).
>>>>>>
>>>>>> Thanks,
>>>>>> -Kevin
>>>>>>
>>>>>>
>>>>>> On Tue, Dec 17, 2013 at 10:14 AM, Ken Nielson <
>>>>>> knielson at adaptivecomputing.com> wrote:
>>>>>>
>>>>>>> Glen,
>>>>>>>
>>>>>>> You are right. My mistake. procs does work.
>>>>>>>
>>>>>>> Ken
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Dec 16, 2013 at 1:26 PM, Glen Beane <glen.beane at gmail.com>wrote:
>>>>>>>
>>>>>>>> I thought this had been fixed and procs had been made a real
>>>>>>>> resource in Torque (meaning it works as expected with qrun or pbs_sched).
>>>>>>>> I think the problem here is Maui.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Dec 16, 2013 at 2:51 PM, Ken Nielson <
>>>>>>>> knielson at adaptivecomputing.com> wrote:
>>>>>>>>
>>>>>>>>> Kevin,
>>>>>>>>>
>>>>>>>>> procs is a pass through resource for TORQUE. That is, TORQUE only
>>>>>>>>> allows it to be accepted because it will hand it to the scheduler and the
>>>>>>>>> scheduler will interpret the command. Depending on how you have qmgr
>>>>>>>>> configured the default number of nodes for a job is one with just one proc
>>>>>>>>> from TORQUE.
>>>>>>>>>
>>>>>>>>> You could use -l nodes=x instead. Otherwise, it is up to Maui to
>>>>>>>>> interpret the meaning of procs.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Dec 16, 2013 at 11:42 AM, Kevin Sutherland <
>>>>>>>>> sutherland.kevinr at gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Greetings,
>>>>>>>>>>
>>>>>>>>>> I have posted this on both torque and maui user boards as I am
>>>>>>>>>> unsure whether the issue is in maui or torque (although we had this same
>>>>>>>>>> problem before we ran maui)
>>>>>>>>>>
>>>>>>>>>> I am configuring a cluster for engineering simulation use at my
>>>>>>>>>> office. We have two clusters (one with 12 nodes and 16 processors per node
>>>>>>>>>> and the other is a 5 node cluster with 16 processors per node, except for a
>>>>>>>>>> bigmem machine with 32 processors).
>>>>>>>>>>
>>>>>>>>>> I am only working on the 5 node cluster at this time, but the
>>>>>>>>>> behavior I am dealing with is on both clusters. When the procs syntax is
>>>>>>>>>> used, the system is defaulting to 1 process, even though procs is > 1. All
>>>>>>>>>> nodes show free when issuing qnodes or pbsnodes -a and list the appropriate
>>>>>>>>>> number of cpus defined in the nodes file.
>>>>>>>>>>
>>>>>>>>>> I have a simple test script:
>>>>>>>>>>
>>>>>>>>>> #!/bin/bash
>>>>>>>>>>
>>>>>>>>>> #PBS -S /bin/bash
>>>>>>>>>> #PBS -l nodes=2:ppn=8
>>>>>>>>>> #PBS -j oe
>>>>>>>>>>
>>>>>>>>>> cat $PBS_NODEFILE
>>>>>>>>>>
>>>>>>>>>> This script prints out:
>>>>>>>>>>
>>>>>>>>>> pegasus.am1.mnet
>>>>>>>>>> pegasus.am1.mnet
>>>>>>>>>> pegasus.am1.mnet
>>>>>>>>>> pegasus.am1.mnet
>>>>>>>>>> pegasus.am1.mnet
>>>>>>>>>> pegasus.am1.mnet
>>>>>>>>>> pegasus.am1.mnet
>>>>>>>>>> pegasus.am1.mnet
>>>>>>>>>> amdfr1.am1.mnet
>>>>>>>>>> amdfr1.am1.mnet
>>>>>>>>>> amdfr1.am1.mnet
>>>>>>>>>> amdfr1.am1.mnet
>>>>>>>>>> amdfr1.am1.mnet
>>>>>>>>>> amdfr1.am1.mnet
>>>>>>>>>> amdfr1.am1.mnet
>>>>>>>>>> amdfr1.am1.mnet
>>>>>>>>>>
>>>>>>>>>> Which is expected. When I change the PBS resource list to:
>>>>>>>>>>
>>>>>>>>>> #PBS -l procs=32
>>>>>>>>>>
>>>>>>>>>> I get the following:
>>>>>>>>>>
>>>>>>>>>> pegasus.am1.mnet
>>>>>>>>>>
>>>>>>>>>> The machine filed created in /var/spool/torque/aux simply has 1
>>>>>>>>>> entry for 1 process, even though I requested 32. We have a piece of
>>>>>>>>>> simulation software that REQUIRES the use of the "-l procs=n" syntax to
>>>>>>>>>> function on the cluster. (ANSYS does not plan to permit changes to this
>>>>>>>>>> until Release 16 in 2015) We are trying to use our cluster with Ansys RSM
>>>>>>>>>> with CFX and Fluent.
>>>>>>>>>>
>>>>>>>>>> We are running torque 4.2.6.1 and Maui 3.3.1.
>>>>>>>>>>
>>>>>>>>>> My queue and server attributes are defined as follows:
>>>>>>>>>>
>>>>>>>>>> #
>>>>>>>>>> # Create queues and set their attributes.
>>>>>>>>>> #
>>>>>>>>>> #
>>>>>>>>>> # Create and define queue batch
>>>>>>>>>> #
>>>>>>>>>> create queue batch
>>>>>>>>>> set queue batch queue_type = Execution
>>>>>>>>>> set queue batch resources_default.walltime = 01:00:00
>>>>>>>>>> set queue batch enabled = True
>>>>>>>>>> set queue batch started = True
>>>>>>>>>> #
>>>>>>>>>> # Set server attributes.
>>>>>>>>>> #
>>>>>>>>>> set server scheduling = True
>>>>>>>>>> set server acl_hosts = titan1.am1.mnet
>>>>>>>>>> set server managers = kevin at titan1.am1.mnet
>>>>>>>>>> set server managers += root at titan1.am1.mnet
>>>>>>>>>> set server operators = kevin at titan1.am1.mnet
>>>>>>>>>> set server operators += root at titan1.am1.mnet
>>>>>>>>>> set server default_queue = batch
>>>>>>>>>> set server log_events = 511
>>>>>>>>>> set server mail_from = adm
>>>>>>>>>> set server scheduler_iteration = 600
>>>>>>>>>> set server node_check_rate = 150
>>>>>>>>>> set server tcp_timeout = 300
>>>>>>>>>> set server job_stat_rate = 45
>>>>>>>>>> set server poll_jobs = True
>>>>>>>>>> set server mom_job_sync = True
>>>>>>>>>> set server keep_completed = 300
>>>>>>>>>> set server submit_hosts = titan1.am1.mnet
>>>>>>>>>> set server next_job_number = 8
>>>>>>>>>> set server moab_array_compatible = True
>>>>>>>>>> set server nppcu = 1
>>>>>>>>>>
>>>>>>>>>> My torque nodes file is:
>>>>>>>>>>
>>>>>>>>>> titan1.am1.mnet np=16 RAM64GB
>>>>>>>>>> titan2.am1.mnet np=16 RAM64GB
>>>>>>>>>> amdfl1.am1.mnet np=16 RAM64GB
>>>>>>>>>> amdfr1.am1.mnet np=16 RAM64GB
>>>>>>>>>> pegasus.am1.mnet np=32 RAM128GB
>>>>>>>>>>
>>>>>>>>>> Our maui.cfg file is:
>>>>>>>>>>
>>>>>>>>>> # maui.cfg 3.3.1
>>>>>>>>>>
>>>>>>>>>> SERVERHOST            titan1.am1.mnet
>>>>>>>>>> # primary admin must be first in list
>>>>>>>>>> ADMIN1                root kevin
>>>>>>>>>> ADMIN3              ALL
>>>>>>>>>>
>>>>>>>>>> # Resource Manager Definition
>>>>>>>>>>
>>>>>>>>>> RMCFG[TITAN1.AM1.MNET] TYPE=PBS
>>>>>>>>>>
>>>>>>>>>> # Allocation Manager Definition
>>>>>>>>>>
>>>>>>>>>> AMCFG[bank]  TYPE=NONE
>>>>>>>>>>
>>>>>>>>>> # full parameter docs at
>>>>>>>>>> http://supercluster.org/mauidocs/a.fparameters.html
>>>>>>>>>> # use the 'schedctl -l' command to display current configuration
>>>>>>>>>>
>>>>>>>>>> RMPOLLINTERVAL        00:00:30
>>>>>>>>>>
>>>>>>>>>> SERVERPORT            42559
>>>>>>>>>> SERVERMODE            NORMAL
>>>>>>>>>>
>>>>>>>>>> # Admin: http://supercluster.org/mauidocs/a.esecurity.html
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> LOGFILE               maui.log
>>>>>>>>>> LOGFILEMAXSIZE        10000000
>>>>>>>>>> LOGLEVEL              3
>>>>>>>>>>
>>>>>>>>>> # Job Priority:
>>>>>>>>>> http://supercluster.org/mauidocs/5.1jobprioritization.html
>>>>>>>>>>
>>>>>>>>>> QUEUETIMEWEIGHT       1
>>>>>>>>>>
>>>>>>>>>> # FairShare: http://supercluster.org/mauidocs/6.3fairshare.html
>>>>>>>>>>
>>>>>>>>>> #FSPOLICY              PSDEDICATED
>>>>>>>>>> #FSDEPTH               7
>>>>>>>>>> #FSINTERVAL            86400
>>>>>>>>>> #FSDECAY               0.80
>>>>>>>>>>
>>>>>>>>>> # Throttling Policies:
>>>>>>>>>> http://supercluster.org/mauidocs/6.2throttlingpolicies.html
>>>>>>>>>>
>>>>>>>>>> # NONE SPECIFIED
>>>>>>>>>>
>>>>>>>>>> # Backfill: http://supercluster.org/mauidocs/8.2backfill.html
>>>>>>>>>>
>>>>>>>>>> BACKFILLPOLICY        FIRSTFIT
>>>>>>>>>> RESERVATIONPOLICY     CURRENTHIGHEST
>>>>>>>>>>
>>>>>>>>>> # Node Allocation:
>>>>>>>>>> http://supercluster.org/mauidocs/5.2nodeallocation.html
>>>>>>>>>>
>>>>>>>>>> NODEALLOCATIONPOLICY  MINRESOURCE
>>>>>>>>>>
>>>>>>>>>> # Kevin's Modifications:
>>>>>>>>>>
>>>>>>>>>> JOBNODEMATCHPOLICY EXACTNODE
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> # QOS: http://supercluster.org/mauidocs/7.3qos.html
>>>>>>>>>>
>>>>>>>>>> # QOSCFG[hi]  PRIORITY=100 XFTARGET=100 FLAGS=PREEMPTOR:IGNMAXJOB
>>>>>>>>>> # QOSCFG[low] PRIORITY=-1000 FLAGS=PREEMPTEE
>>>>>>>>>>
>>>>>>>>>> # Standing Reservations:
>>>>>>>>>> http://supercluster.org/mauidocs/7.1.3standingreservations.html
>>>>>>>>>>
>>>>>>>>>> # SRSTARTTIME[test] 8:00:00
>>>>>>>>>> # SRENDTIME[test]   17:00:00
>>>>>>>>>> # SRDAYS[test]      MON TUE WED THU FRI
>>>>>>>>>> # SRTASKCOUNT[test] 20
>>>>>>>>>> # SRMAXTIME[test]   0:30:00
>>>>>>>>>>
>>>>>>>>>> # Creds:
>>>>>>>>>> http://supercluster.org/mauidocs/6.1fairnessoverview.html
>>>>>>>>>>
>>>>>>>>>> # USERCFG[DEFAULT]      FSTARGET=25.0
>>>>>>>>>> # USERCFG[john]         PRIORITY=100  FSTARGET=10.0-
>>>>>>>>>> # GROUPCFG[staff]       PRIORITY=1000 QLIST=hi:low QDEF=hi
>>>>>>>>>> # CLASSCFG[batch]       FLAGS=PREEMPTEE
>>>>>>>>>> # CLASSCFG[interactive] FLAGS=PREEMPTOR
>>>>>>>>>>
>>>>>>>>>> Our MOM config file is:
>>>>>>>>>>
>>>>>>>>>> $pbsserver    10.0.0.10    # IP address of titan1.am1.mnet
>>>>>>>>>> $clienthost    10.0.0.10    # IP address of management node
>>>>>>>>>> $usecp        *:/home/kevin /home/kevin
>>>>>>>>>> $usecp        *:/home /home
>>>>>>>>>> $usecp        *:/root /root
>>>>>>>>>> $usecp        *:/home/mpi /home/mpi
>>>>>>>>>> $tmpdir        /home/mpi/tmp
>>>>>>>>>>
>>>>>>>>>> I am finding it difficult to identify the configuration issue. I
>>>>>>>>>> thought this thread would help:
>>>>>>>>>>
>>>>>>>>>> http://comments.gmane.org/gmane.comp.clustering.maui.user/2859
>>>>>>>>>>
>>>>>>>>>> but their examples show the machine file is working correctly and
>>>>>>>>>> they are battling memory allocations. I can't seem to get that far yet. Any
>>>>>>>>>> thoughts?
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Kevin Sutherland
>>>>>>>>>> Simulations Specialist
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> torqueusers mailing list
>>>>>>>>>> torqueusers at supercluster.org
>>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Ken Nielson
>>>>>>>>> +1 801.717.3700 office +1 801.717.3738 fax
>>>>>>>>> 1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
>>>>>>>>> www.adaptivecomputing.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> torqueusers mailing list
>>>>>>>>> torqueusers at supercluster.org
>>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> torqueusers mailing list
>>>>>>>> torqueusers at supercluster.org
>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ken Nielson
>>>>>>> +1 801.717.3700 office +1 801.717.3738 fax
>>>>>>> 1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
>>>>>>> www.adaptivecomputing.com
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> torqueusers mailing list
>>>>>>> torqueusers at supercluster.org
>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Kevin Sutherland
>>>>>>
>>>>>> _______________________________________________
>>>>>> torqueusers mailing list
>>>>>> torqueusers at supercluster.org
>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ken Nielson
>>>>> +1 801.717.3700 office +1 801.717.3738 fax
>>>>> 1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
>>>>> www.adaptivecomputing.com
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> torqueusers mailing list
>>>>> torqueusers at supercluster.org
>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>
>>>>>
>>>>
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>
>>
>> --
>> Ken Nielson
>> +1 801.717.3700 office +1 801.717.3738 fax
>> 1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
>> www.adaptivecomputing.com
>>
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>
>
> --
> Kevin Sutherland
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131218/00af5371/attachment-0001.html 


More information about the torqueusers mailing list