[torqueusers] pbs -l procs=n syntax defaults to 1

Ken Nielson knielson at adaptivecomputing.com
Wed Dec 18 10:44:03 MST 2013


Well, there it is. It looks like we have a regression.


On Wed, Dec 18, 2013 at 10:40 AM, Glen Beane <glen.beane at gmail.com> wrote:

> Torque 2.5.0 CHANGELOG entry:
>
>  e - Enabled TORQUE to be able to parse the -l procs=x node spec.
> Previously
>       TORQUE simply recored the value of x for procs in Resources_List. It
>       now takes that value and allocates x processors packed on any
> available
>       node. (Ken Nielson Adaptive Computing. June 17, 2010)
>
>
>
>
>
> On Wed, Dec 18, 2013 at 12:33 PM, Glen Beane <glen.beane at gmail.com> wrote:
>
>> Ken,
>>
>> this seems like a regression then.
>>
>>
>> On Wed, Dec 18, 2013 at 10:57 AM, Ken Nielson <
>> knielson at adaptivecomputing.com> wrote:
>>
>>> I am flip flopping. Back to what I originally said about pass through
>>> directives. procs is a pass through. TORQUE ignores it. It will allow you
>>> to submit the job but when you do a qrun you will be one node with one core
>>> to run the job. The scheduler is the one to interpret the meaning of procs.
>>> In the case of Moab it means give me x cores anywhere you can.
>>>
>>>
>>>
>>>
>>> On Tue, Dec 17, 2013 at 4:53 PM, Kevin Sutherland <
>>> sutherland.kevinr at gmail.com> wrote:
>>>
>>>> To Ken's response...I would like to know how? The issue may be in our
>>>> queue config (I attached this in my first post). I am unsure how to get the
>>>> procs syntax working. We had the issue without Maui in the mix as well.
>>>> Does anyone have a working setup with the procs syntax they could walk
>>>> through with me? Even if it's just copies of the config files with the
>>>> pertinent syntax described...we would really like to avoid going the
>>>> commercial route if we can avoid it (HPC Suite from Adaptive Computing).
>>>>
>>>> Thanks,
>>>> -Kevin
>>>>
>>>>
>>>> On Tue, Dec 17, 2013 at 10:14 AM, Ken Nielson <
>>>> knielson at adaptivecomputing.com> wrote:
>>>>
>>>>> Glen,
>>>>>
>>>>> You are right. My mistake. procs does work.
>>>>>
>>>>> Ken
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Dec 16, 2013 at 1:26 PM, Glen Beane <glen.beane at gmail.com>wrote:
>>>>>
>>>>>> I thought this had been fixed and procs had been made a real resource
>>>>>> in Torque (meaning it works as expected with qrun or pbs_sched).  I think
>>>>>> the problem here is Maui.
>>>>>>
>>>>>>
>>>>>> On Mon, Dec 16, 2013 at 2:51 PM, Ken Nielson <
>>>>>> knielson at adaptivecomputing.com> wrote:
>>>>>>
>>>>>>> Kevin,
>>>>>>>
>>>>>>> procs is a pass through resource for TORQUE. That is, TORQUE only
>>>>>>> allows it to be accepted because it will hand it to the scheduler and the
>>>>>>> scheduler will interpret the command. Depending on how you have qmgr
>>>>>>> configured the default number of nodes for a job is one with just one proc
>>>>>>> from TORQUE.
>>>>>>>
>>>>>>> You could use -l nodes=x instead. Otherwise, it is up to Maui to
>>>>>>> interpret the meaning of procs.
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Dec 16, 2013 at 11:42 AM, Kevin Sutherland <
>>>>>>> sutherland.kevinr at gmail.com> wrote:
>>>>>>>
>>>>>>>> Greetings,
>>>>>>>>
>>>>>>>> I have posted this on both torque and maui user boards as I am
>>>>>>>> unsure whether the issue is in maui or torque (although we had this same
>>>>>>>> problem before we ran maui)
>>>>>>>>
>>>>>>>> I am configuring a cluster for engineering simulation use at my
>>>>>>>> office. We have two clusters (one with 12 nodes and 16 processors per node
>>>>>>>> and the other is a 5 node cluster with 16 processors per node, except for a
>>>>>>>> bigmem machine with 32 processors).
>>>>>>>>
>>>>>>>> I am only working on the 5 node cluster at this time, but the
>>>>>>>> behavior I am dealing with is on both clusters. When the procs syntax is
>>>>>>>> used, the system is defaulting to 1 process, even though procs is > 1. All
>>>>>>>> nodes show free when issuing qnodes or pbsnodes -a and list the appropriate
>>>>>>>> number of cpus defined in the nodes file.
>>>>>>>>
>>>>>>>> I have a simple test script:
>>>>>>>>
>>>>>>>> #!/bin/bash
>>>>>>>>
>>>>>>>> #PBS -S /bin/bash
>>>>>>>> #PBS -l nodes=2:ppn=8
>>>>>>>> #PBS -j oe
>>>>>>>>
>>>>>>>> cat $PBS_NODEFILE
>>>>>>>>
>>>>>>>> This script prints out:
>>>>>>>>
>>>>>>>> pegasus.am1.mnet
>>>>>>>> pegasus.am1.mnet
>>>>>>>> pegasus.am1.mnet
>>>>>>>> pegasus.am1.mnet
>>>>>>>> pegasus.am1.mnet
>>>>>>>> pegasus.am1.mnet
>>>>>>>> pegasus.am1.mnet
>>>>>>>> pegasus.am1.mnet
>>>>>>>> amdfr1.am1.mnet
>>>>>>>> amdfr1.am1.mnet
>>>>>>>> amdfr1.am1.mnet
>>>>>>>> amdfr1.am1.mnet
>>>>>>>> amdfr1.am1.mnet
>>>>>>>> amdfr1.am1.mnet
>>>>>>>> amdfr1.am1.mnet
>>>>>>>> amdfr1.am1.mnet
>>>>>>>>
>>>>>>>> Which is expected. When I change the PBS resource list to:
>>>>>>>>
>>>>>>>> #PBS -l procs=32
>>>>>>>>
>>>>>>>> I get the following:
>>>>>>>>
>>>>>>>> pegasus.am1.mnet
>>>>>>>>
>>>>>>>> The machine filed created in /var/spool/torque/aux simply has 1
>>>>>>>> entry for 1 process, even though I requested 32. We have a piece of
>>>>>>>> simulation software that REQUIRES the use of the "-l procs=n" syntax to
>>>>>>>> function on the cluster. (ANSYS does not plan to permit changes to this
>>>>>>>> until Release 16 in 2015) We are trying to use our cluster with Ansys RSM
>>>>>>>> with CFX and Fluent.
>>>>>>>>
>>>>>>>> We are running torque 4.2.6.1 and Maui 3.3.1.
>>>>>>>>
>>>>>>>> My queue and server attributes are defined as follows:
>>>>>>>>
>>>>>>>> #
>>>>>>>> # Create queues and set their attributes.
>>>>>>>> #
>>>>>>>> #
>>>>>>>> # Create and define queue batch
>>>>>>>> #
>>>>>>>> create queue batch
>>>>>>>> set queue batch queue_type = Execution
>>>>>>>> set queue batch resources_default.walltime = 01:00:00
>>>>>>>> set queue batch enabled = True
>>>>>>>> set queue batch started = True
>>>>>>>> #
>>>>>>>> # Set server attributes.
>>>>>>>> #
>>>>>>>> set server scheduling = True
>>>>>>>> set server acl_hosts = titan1.am1.mnet
>>>>>>>> set server managers = kevin at titan1.am1.mnet
>>>>>>>> set server managers += root at titan1.am1.mnet
>>>>>>>> set server operators = kevin at titan1.am1.mnet
>>>>>>>> set server operators += root at titan1.am1.mnet
>>>>>>>> set server default_queue = batch
>>>>>>>> set server log_events = 511
>>>>>>>> set server mail_from = adm
>>>>>>>> set server scheduler_iteration = 600
>>>>>>>> set server node_check_rate = 150
>>>>>>>> set server tcp_timeout = 300
>>>>>>>> set server job_stat_rate = 45
>>>>>>>> set server poll_jobs = True
>>>>>>>> set server mom_job_sync = True
>>>>>>>> set server keep_completed = 300
>>>>>>>> set server submit_hosts = titan1.am1.mnet
>>>>>>>> set server next_job_number = 8
>>>>>>>> set server moab_array_compatible = True
>>>>>>>> set server nppcu = 1
>>>>>>>>
>>>>>>>> My torque nodes file is:
>>>>>>>>
>>>>>>>> titan1.am1.mnet np=16 RAM64GB
>>>>>>>> titan2.am1.mnet np=16 RAM64GB
>>>>>>>> amdfl1.am1.mnet np=16 RAM64GB
>>>>>>>> amdfr1.am1.mnet np=16 RAM64GB
>>>>>>>> pegasus.am1.mnet np=32 RAM128GB
>>>>>>>>
>>>>>>>> Our maui.cfg file is:
>>>>>>>>
>>>>>>>> # maui.cfg 3.3.1
>>>>>>>>
>>>>>>>> SERVERHOST            titan1.am1.mnet
>>>>>>>> # primary admin must be first in list
>>>>>>>> ADMIN1                root kevin
>>>>>>>> ADMIN3              ALL
>>>>>>>>
>>>>>>>> # Resource Manager Definition
>>>>>>>>
>>>>>>>> RMCFG[TITAN1.AM1.MNET] TYPE=PBS
>>>>>>>>
>>>>>>>> # Allocation Manager Definition
>>>>>>>>
>>>>>>>> AMCFG[bank]  TYPE=NONE
>>>>>>>>
>>>>>>>> # full parameter docs at
>>>>>>>> http://supercluster.org/mauidocs/a.fparameters.html
>>>>>>>> # use the 'schedctl -l' command to display current configuration
>>>>>>>>
>>>>>>>> RMPOLLINTERVAL        00:00:30
>>>>>>>>
>>>>>>>> SERVERPORT            42559
>>>>>>>> SERVERMODE            NORMAL
>>>>>>>>
>>>>>>>> # Admin: http://supercluster.org/mauidocs/a.esecurity.html
>>>>>>>>
>>>>>>>>
>>>>>>>> LOGFILE               maui.log
>>>>>>>> LOGFILEMAXSIZE        10000000
>>>>>>>> LOGLEVEL              3
>>>>>>>>
>>>>>>>> # Job Priority:
>>>>>>>> http://supercluster.org/mauidocs/5.1jobprioritization.html
>>>>>>>>
>>>>>>>> QUEUETIMEWEIGHT       1
>>>>>>>>
>>>>>>>> # FairShare: http://supercluster.org/mauidocs/6.3fairshare.html
>>>>>>>>
>>>>>>>> #FSPOLICY              PSDEDICATED
>>>>>>>> #FSDEPTH               7
>>>>>>>> #FSINTERVAL            86400
>>>>>>>> #FSDECAY               0.80
>>>>>>>>
>>>>>>>> # Throttling Policies:
>>>>>>>> http://supercluster.org/mauidocs/6.2throttlingpolicies.html
>>>>>>>>
>>>>>>>> # NONE SPECIFIED
>>>>>>>>
>>>>>>>> # Backfill: http://supercluster.org/mauidocs/8.2backfill.html
>>>>>>>>
>>>>>>>> BACKFILLPOLICY        FIRSTFIT
>>>>>>>> RESERVATIONPOLICY     CURRENTHIGHEST
>>>>>>>>
>>>>>>>> # Node Allocation:
>>>>>>>> http://supercluster.org/mauidocs/5.2nodeallocation.html
>>>>>>>>
>>>>>>>> NODEALLOCATIONPOLICY  MINRESOURCE
>>>>>>>>
>>>>>>>> # Kevin's Modifications:
>>>>>>>>
>>>>>>>> JOBNODEMATCHPOLICY EXACTNODE
>>>>>>>>
>>>>>>>>
>>>>>>>> # QOS: http://supercluster.org/mauidocs/7.3qos.html
>>>>>>>>
>>>>>>>> # QOSCFG[hi]  PRIORITY=100 XFTARGET=100 FLAGS=PREEMPTOR:IGNMAXJOB
>>>>>>>> # QOSCFG[low] PRIORITY=-1000 FLAGS=PREEMPTEE
>>>>>>>>
>>>>>>>> # Standing Reservations:
>>>>>>>> http://supercluster.org/mauidocs/7.1.3standingreservations.html
>>>>>>>>
>>>>>>>> # SRSTARTTIME[test] 8:00:00
>>>>>>>> # SRENDTIME[test]   17:00:00
>>>>>>>> # SRDAYS[test]      MON TUE WED THU FRI
>>>>>>>> # SRTASKCOUNT[test] 20
>>>>>>>> # SRMAXTIME[test]   0:30:00
>>>>>>>>
>>>>>>>> # Creds: http://supercluster.org/mauidocs/6.1fairnessoverview.html
>>>>>>>>
>>>>>>>> # USERCFG[DEFAULT]      FSTARGET=25.0
>>>>>>>> # USERCFG[john]         PRIORITY=100  FSTARGET=10.0-
>>>>>>>> # GROUPCFG[staff]       PRIORITY=1000 QLIST=hi:low QDEF=hi
>>>>>>>> # CLASSCFG[batch]       FLAGS=PREEMPTEE
>>>>>>>> # CLASSCFG[interactive] FLAGS=PREEMPTOR
>>>>>>>>
>>>>>>>> Our MOM config file is:
>>>>>>>>
>>>>>>>> $pbsserver    10.0.0.10    # IP address of titan1.am1.mnet
>>>>>>>> $clienthost    10.0.0.10    # IP address of management node
>>>>>>>> $usecp        *:/home/kevin /home/kevin
>>>>>>>> $usecp        *:/home /home
>>>>>>>> $usecp        *:/root /root
>>>>>>>> $usecp        *:/home/mpi /home/mpi
>>>>>>>> $tmpdir        /home/mpi/tmp
>>>>>>>>
>>>>>>>> I am finding it difficult to identify the configuration issue. I
>>>>>>>> thought this thread would help:
>>>>>>>>
>>>>>>>> http://comments.gmane.org/gmane.comp.clustering.maui.user/2859
>>>>>>>>
>>>>>>>> but their examples show the machine file is working correctly and
>>>>>>>> they are battling memory allocations. I can't seem to get that far yet. Any
>>>>>>>> thoughts?
>>>>>>>>
>>>>>>>> --
>>>>>>>> Kevin Sutherland
>>>>>>>> Simulations Specialist
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> torqueusers mailing list
>>>>>>>> torqueusers at supercluster.org
>>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ken Nielson
>>>>>>> +1 801.717.3700 office +1 801.717.3738 fax
>>>>>>> 1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
>>>>>>> www.adaptivecomputing.com
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> torqueusers mailing list
>>>>>>> torqueusers at supercluster.org
>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> torqueusers mailing list
>>>>>> torqueusers at supercluster.org
>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ken Nielson
>>>>> +1 801.717.3700 office +1 801.717.3738 fax
>>>>> 1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
>>>>> www.adaptivecomputing.com
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> torqueusers mailing list
>>>>> torqueusers at supercluster.org
>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Kevin Sutherland
>>>>
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>
>>>>
>>>
>>>
>>> --
>>> Ken Nielson
>>> +1 801.717.3700 office +1 801.717.3738 fax
>>> 1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
>>> www.adaptivecomputing.com
>>>
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>


-- 
Ken Nielson
+1 801.717.3700 office +1 801.717.3738 fax
1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
www.adaptivecomputing.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131218/20b44c28/attachment-0001.html 


More information about the torqueusers mailing list