[torqueusers] pbs -l procs=n syntax defaults to 1

Glen Beane glen.beane at gmail.com
Wed Dec 18 10:33:14 MST 2013


Ken,

this seems like a regression then.


On Wed, Dec 18, 2013 at 10:57 AM, Ken Nielson <
knielson at adaptivecomputing.com> wrote:

> I am flip flopping. Back to what I originally said about pass through
> directives. procs is a pass through. TORQUE ignores it. It will allow you
> to submit the job but when you do a qrun you will be one node with one core
> to run the job. The scheduler is the one to interpret the meaning of procs.
> In the case of Moab it means give me x cores anywhere you can.
>
>
>
>
> On Tue, Dec 17, 2013 at 4:53 PM, Kevin Sutherland <
> sutherland.kevinr at gmail.com> wrote:
>
>> To Ken's response...I would like to know how? The issue may be in our
>> queue config (I attached this in my first post). I am unsure how to get the
>> procs syntax working. We had the issue without Maui in the mix as well.
>> Does anyone have a working setup with the procs syntax they could walk
>> through with me? Even if it's just copies of the config files with the
>> pertinent syntax described...we would really like to avoid going the
>> commercial route if we can avoid it (HPC Suite from Adaptive Computing).
>>
>> Thanks,
>> -Kevin
>>
>>
>> On Tue, Dec 17, 2013 at 10:14 AM, Ken Nielson <
>> knielson at adaptivecomputing.com> wrote:
>>
>>> Glen,
>>>
>>> You are right. My mistake. procs does work.
>>>
>>> Ken
>>>
>>>
>>>
>>>
>>> On Mon, Dec 16, 2013 at 1:26 PM, Glen Beane <glen.beane at gmail.com>wrote:
>>>
>>>> I thought this had been fixed and procs had been made a real resource
>>>> in Torque (meaning it works as expected with qrun or pbs_sched).  I think
>>>> the problem here is Maui.
>>>>
>>>>
>>>> On Mon, Dec 16, 2013 at 2:51 PM, Ken Nielson <
>>>> knielson at adaptivecomputing.com> wrote:
>>>>
>>>>> Kevin,
>>>>>
>>>>> procs is a pass through resource for TORQUE. That is, TORQUE only
>>>>> allows it to be accepted because it will hand it to the scheduler and the
>>>>> scheduler will interpret the command. Depending on how you have qmgr
>>>>> configured the default number of nodes for a job is one with just one proc
>>>>> from TORQUE.
>>>>>
>>>>> You could use -l nodes=x instead. Otherwise, it is up to Maui to
>>>>> interpret the meaning of procs.
>>>>>
>>>>>
>>>>> On Mon, Dec 16, 2013 at 11:42 AM, Kevin Sutherland <
>>>>> sutherland.kevinr at gmail.com> wrote:
>>>>>
>>>>>> Greetings,
>>>>>>
>>>>>> I have posted this on both torque and maui user boards as I am unsure
>>>>>> whether the issue is in maui or torque (although we had this same problem
>>>>>> before we ran maui)
>>>>>>
>>>>>> I am configuring a cluster for engineering simulation use at my
>>>>>> office. We have two clusters (one with 12 nodes and 16 processors per node
>>>>>> and the other is a 5 node cluster with 16 processors per node, except for a
>>>>>> bigmem machine with 32 processors).
>>>>>>
>>>>>> I am only working on the 5 node cluster at this time, but the
>>>>>> behavior I am dealing with is on both clusters. When the procs syntax is
>>>>>> used, the system is defaulting to 1 process, even though procs is > 1. All
>>>>>> nodes show free when issuing qnodes or pbsnodes -a and list the appropriate
>>>>>> number of cpus defined in the nodes file.
>>>>>>
>>>>>> I have a simple test script:
>>>>>>
>>>>>> #!/bin/bash
>>>>>>
>>>>>> #PBS -S /bin/bash
>>>>>> #PBS -l nodes=2:ppn=8
>>>>>> #PBS -j oe
>>>>>>
>>>>>> cat $PBS_NODEFILE
>>>>>>
>>>>>> This script prints out:
>>>>>>
>>>>>> pegasus.am1.mnet
>>>>>> pegasus.am1.mnet
>>>>>> pegasus.am1.mnet
>>>>>> pegasus.am1.mnet
>>>>>> pegasus.am1.mnet
>>>>>> pegasus.am1.mnet
>>>>>> pegasus.am1.mnet
>>>>>> pegasus.am1.mnet
>>>>>> amdfr1.am1.mnet
>>>>>> amdfr1.am1.mnet
>>>>>> amdfr1.am1.mnet
>>>>>> amdfr1.am1.mnet
>>>>>> amdfr1.am1.mnet
>>>>>> amdfr1.am1.mnet
>>>>>> amdfr1.am1.mnet
>>>>>> amdfr1.am1.mnet
>>>>>>
>>>>>> Which is expected. When I change the PBS resource list to:
>>>>>>
>>>>>> #PBS -l procs=32
>>>>>>
>>>>>> I get the following:
>>>>>>
>>>>>> pegasus.am1.mnet
>>>>>>
>>>>>> The machine filed created in /var/spool/torque/aux simply has 1 entry
>>>>>> for 1 process, even though I requested 32. We have a piece of simulation
>>>>>> software that REQUIRES the use of the "-l procs=n" syntax to function on
>>>>>> the cluster. (ANSYS does not plan to permit changes to this until Release
>>>>>> 16 in 2015) We are trying to use our cluster with Ansys RSM with CFX and
>>>>>> Fluent.
>>>>>>
>>>>>> We are running torque 4.2.6.1 and Maui 3.3.1.
>>>>>>
>>>>>> My queue and server attributes are defined as follows:
>>>>>>
>>>>>> #
>>>>>> # Create queues and set their attributes.
>>>>>> #
>>>>>> #
>>>>>> # Create and define queue batch
>>>>>> #
>>>>>> create queue batch
>>>>>> set queue batch queue_type = Execution
>>>>>> set queue batch resources_default.walltime = 01:00:00
>>>>>> set queue batch enabled = True
>>>>>> set queue batch started = True
>>>>>> #
>>>>>> # Set server attributes.
>>>>>> #
>>>>>> set server scheduling = True
>>>>>> set server acl_hosts = titan1.am1.mnet
>>>>>> set server managers = kevin at titan1.am1.mnet
>>>>>> set server managers += root at titan1.am1.mnet
>>>>>> set server operators = kevin at titan1.am1.mnet
>>>>>> set server operators += root at titan1.am1.mnet
>>>>>> set server default_queue = batch
>>>>>> set server log_events = 511
>>>>>> set server mail_from = adm
>>>>>> set server scheduler_iteration = 600
>>>>>> set server node_check_rate = 150
>>>>>> set server tcp_timeout = 300
>>>>>> set server job_stat_rate = 45
>>>>>> set server poll_jobs = True
>>>>>> set server mom_job_sync = True
>>>>>> set server keep_completed = 300
>>>>>> set server submit_hosts = titan1.am1.mnet
>>>>>> set server next_job_number = 8
>>>>>> set server moab_array_compatible = True
>>>>>> set server nppcu = 1
>>>>>>
>>>>>> My torque nodes file is:
>>>>>>
>>>>>> titan1.am1.mnet np=16 RAM64GB
>>>>>> titan2.am1.mnet np=16 RAM64GB
>>>>>> amdfl1.am1.mnet np=16 RAM64GB
>>>>>> amdfr1.am1.mnet np=16 RAM64GB
>>>>>> pegasus.am1.mnet np=32 RAM128GB
>>>>>>
>>>>>> Our maui.cfg file is:
>>>>>>
>>>>>> # maui.cfg 3.3.1
>>>>>>
>>>>>> SERVERHOST            titan1.am1.mnet
>>>>>> # primary admin must be first in list
>>>>>> ADMIN1                root kevin
>>>>>> ADMIN3              ALL
>>>>>>
>>>>>> # Resource Manager Definition
>>>>>>
>>>>>> RMCFG[TITAN1.AM1.MNET] TYPE=PBS
>>>>>>
>>>>>> # Allocation Manager Definition
>>>>>>
>>>>>> AMCFG[bank]  TYPE=NONE
>>>>>>
>>>>>> # full parameter docs at
>>>>>> http://supercluster.org/mauidocs/a.fparameters.html
>>>>>> # use the 'schedctl -l' command to display current configuration
>>>>>>
>>>>>> RMPOLLINTERVAL        00:00:30
>>>>>>
>>>>>> SERVERPORT            42559
>>>>>> SERVERMODE            NORMAL
>>>>>>
>>>>>> # Admin: http://supercluster.org/mauidocs/a.esecurity.html
>>>>>>
>>>>>>
>>>>>> LOGFILE               maui.log
>>>>>> LOGFILEMAXSIZE        10000000
>>>>>> LOGLEVEL              3
>>>>>>
>>>>>> # Job Priority:
>>>>>> http://supercluster.org/mauidocs/5.1jobprioritization.html
>>>>>>
>>>>>> QUEUETIMEWEIGHT       1
>>>>>>
>>>>>> # FairShare: http://supercluster.org/mauidocs/6.3fairshare.html
>>>>>>
>>>>>> #FSPOLICY              PSDEDICATED
>>>>>> #FSDEPTH               7
>>>>>> #FSINTERVAL            86400
>>>>>> #FSDECAY               0.80
>>>>>>
>>>>>> # Throttling Policies:
>>>>>> http://supercluster.org/mauidocs/6.2throttlingpolicies.html
>>>>>>
>>>>>> # NONE SPECIFIED
>>>>>>
>>>>>> # Backfill: http://supercluster.org/mauidocs/8.2backfill.html
>>>>>>
>>>>>> BACKFILLPOLICY        FIRSTFIT
>>>>>> RESERVATIONPOLICY     CURRENTHIGHEST
>>>>>>
>>>>>> # Node Allocation:
>>>>>> http://supercluster.org/mauidocs/5.2nodeallocation.html
>>>>>>
>>>>>> NODEALLOCATIONPOLICY  MINRESOURCE
>>>>>>
>>>>>> # Kevin's Modifications:
>>>>>>
>>>>>> JOBNODEMATCHPOLICY EXACTNODE
>>>>>>
>>>>>>
>>>>>> # QOS: http://supercluster.org/mauidocs/7.3qos.html
>>>>>>
>>>>>> # QOSCFG[hi]  PRIORITY=100 XFTARGET=100 FLAGS=PREEMPTOR:IGNMAXJOB
>>>>>> # QOSCFG[low] PRIORITY=-1000 FLAGS=PREEMPTEE
>>>>>>
>>>>>> # Standing Reservations:
>>>>>> http://supercluster.org/mauidocs/7.1.3standingreservations.html
>>>>>>
>>>>>> # SRSTARTTIME[test] 8:00:00
>>>>>> # SRENDTIME[test]   17:00:00
>>>>>> # SRDAYS[test]      MON TUE WED THU FRI
>>>>>> # SRTASKCOUNT[test] 20
>>>>>> # SRMAXTIME[test]   0:30:00
>>>>>>
>>>>>> # Creds: http://supercluster.org/mauidocs/6.1fairnessoverview.html
>>>>>>
>>>>>> # USERCFG[DEFAULT]      FSTARGET=25.0
>>>>>> # USERCFG[john]         PRIORITY=100  FSTARGET=10.0-
>>>>>> # GROUPCFG[staff]       PRIORITY=1000 QLIST=hi:low QDEF=hi
>>>>>> # CLASSCFG[batch]       FLAGS=PREEMPTEE
>>>>>> # CLASSCFG[interactive] FLAGS=PREEMPTOR
>>>>>>
>>>>>> Our MOM config file is:
>>>>>>
>>>>>> $pbsserver    10.0.0.10    # IP address of titan1.am1.mnet
>>>>>> $clienthost    10.0.0.10    # IP address of management node
>>>>>> $usecp        *:/home/kevin /home/kevin
>>>>>> $usecp        *:/home /home
>>>>>> $usecp        *:/root /root
>>>>>> $usecp        *:/home/mpi /home/mpi
>>>>>> $tmpdir        /home/mpi/tmp
>>>>>>
>>>>>> I am finding it difficult to identify the configuration issue. I
>>>>>> thought this thread would help:
>>>>>>
>>>>>> http://comments.gmane.org/gmane.comp.clustering.maui.user/2859
>>>>>>
>>>>>> but their examples show the machine file is working correctly and
>>>>>> they are battling memory allocations. I can't seem to get that far yet. Any
>>>>>> thoughts?
>>>>>>
>>>>>> --
>>>>>> Kevin Sutherland
>>>>>> Simulations Specialist
>>>>>>
>>>>>> _______________________________________________
>>>>>> torqueusers mailing list
>>>>>> torqueusers at supercluster.org
>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ken Nielson
>>>>> +1 801.717.3700 office +1 801.717.3738 fax
>>>>> 1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
>>>>> www.adaptivecomputing.com
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> torqueusers mailing list
>>>>> torqueusers at supercluster.org
>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>
>>>>
>>>
>>>
>>> --
>>> Ken Nielson
>>> +1 801.717.3700 office +1 801.717.3738 fax
>>> 1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
>>> www.adaptivecomputing.com
>>>
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>
>>
>> --
>> Kevin Sutherland
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>
>
> --
> Ken Nielson
> +1 801.717.3700 office +1 801.717.3738 fax
> 1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
> www.adaptivecomputing.com
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131218/7a9386d0/attachment-0001.html 


More information about the torqueusers mailing list