[torqueusers] pbs -l procs=n syntax defaults to 1

Ken Nielson knielson at adaptivecomputing.com
Wed Dec 18 08:57:33 MST 2013


I am flip flopping. Back to what I originally said about pass through
directives. procs is a pass through. TORQUE ignores it. It will allow you
to submit the job but when you do a qrun you will be one node with one core
to run the job. The scheduler is the one to interpret the meaning of procs.
In the case of Moab it means give me x cores anywhere you can.




On Tue, Dec 17, 2013 at 4:53 PM, Kevin Sutherland <
sutherland.kevinr at gmail.com> wrote:

> To Ken's response...I would like to know how? The issue may be in our
> queue config (I attached this in my first post). I am unsure how to get the
> procs syntax working. We had the issue without Maui in the mix as well.
> Does anyone have a working setup with the procs syntax they could walk
> through with me? Even if it's just copies of the config files with the
> pertinent syntax described...we would really like to avoid going the
> commercial route if we can avoid it (HPC Suite from Adaptive Computing).
>
> Thanks,
> -Kevin
>
>
> On Tue, Dec 17, 2013 at 10:14 AM, Ken Nielson <
> knielson at adaptivecomputing.com> wrote:
>
>> Glen,
>>
>> You are right. My mistake. procs does work.
>>
>> Ken
>>
>>
>>
>>
>> On Mon, Dec 16, 2013 at 1:26 PM, Glen Beane <glen.beane at gmail.com> wrote:
>>
>>> I thought this had been fixed and procs had been made a real resource in
>>> Torque (meaning it works as expected with qrun or pbs_sched).  I think the
>>> problem here is Maui.
>>>
>>>
>>> On Mon, Dec 16, 2013 at 2:51 PM, Ken Nielson <
>>> knielson at adaptivecomputing.com> wrote:
>>>
>>>> Kevin,
>>>>
>>>> procs is a pass through resource for TORQUE. That is, TORQUE only
>>>> allows it to be accepted because it will hand it to the scheduler and the
>>>> scheduler will interpret the command. Depending on how you have qmgr
>>>> configured the default number of nodes for a job is one with just one proc
>>>> from TORQUE.
>>>>
>>>> You could use -l nodes=x instead. Otherwise, it is up to Maui to
>>>> interpret the meaning of procs.
>>>>
>>>>
>>>> On Mon, Dec 16, 2013 at 11:42 AM, Kevin Sutherland <
>>>> sutherland.kevinr at gmail.com> wrote:
>>>>
>>>>> Greetings,
>>>>>
>>>>> I have posted this on both torque and maui user boards as I am unsure
>>>>> whether the issue is in maui or torque (although we had this same problem
>>>>> before we ran maui)
>>>>>
>>>>> I am configuring a cluster for engineering simulation use at my
>>>>> office. We have two clusters (one with 12 nodes and 16 processors per node
>>>>> and the other is a 5 node cluster with 16 processors per node, except for a
>>>>> bigmem machine with 32 processors).
>>>>>
>>>>> I am only working on the 5 node cluster at this time, but the behavior
>>>>> I am dealing with is on both clusters. When the procs syntax is used, the
>>>>> system is defaulting to 1 process, even though procs is > 1. All nodes show
>>>>> free when issuing qnodes or pbsnodes -a and list the appropriate number of
>>>>> cpus defined in the nodes file.
>>>>>
>>>>> I have a simple test script:
>>>>>
>>>>> #!/bin/bash
>>>>>
>>>>> #PBS -S /bin/bash
>>>>> #PBS -l nodes=2:ppn=8
>>>>> #PBS -j oe
>>>>>
>>>>> cat $PBS_NODEFILE
>>>>>
>>>>> This script prints out:
>>>>>
>>>>> pegasus.am1.mnet
>>>>> pegasus.am1.mnet
>>>>> pegasus.am1.mnet
>>>>> pegasus.am1.mnet
>>>>> pegasus.am1.mnet
>>>>> pegasus.am1.mnet
>>>>> pegasus.am1.mnet
>>>>> pegasus.am1.mnet
>>>>> amdfr1.am1.mnet
>>>>> amdfr1.am1.mnet
>>>>> amdfr1.am1.mnet
>>>>> amdfr1.am1.mnet
>>>>> amdfr1.am1.mnet
>>>>> amdfr1.am1.mnet
>>>>> amdfr1.am1.mnet
>>>>> amdfr1.am1.mnet
>>>>>
>>>>> Which is expected. When I change the PBS resource list to:
>>>>>
>>>>> #PBS -l procs=32
>>>>>
>>>>> I get the following:
>>>>>
>>>>> pegasus.am1.mnet
>>>>>
>>>>> The machine filed created in /var/spool/torque/aux simply has 1 entry
>>>>> for 1 process, even though I requested 32. We have a piece of simulation
>>>>> software that REQUIRES the use of the "-l procs=n" syntax to function on
>>>>> the cluster. (ANSYS does not plan to permit changes to this until Release
>>>>> 16 in 2015) We are trying to use our cluster with Ansys RSM with CFX and
>>>>> Fluent.
>>>>>
>>>>> We are running torque 4.2.6.1 and Maui 3.3.1.
>>>>>
>>>>> My queue and server attributes are defined as follows:
>>>>>
>>>>> #
>>>>> # Create queues and set their attributes.
>>>>> #
>>>>> #
>>>>> # Create and define queue batch
>>>>> #
>>>>> create queue batch
>>>>> set queue batch queue_type = Execution
>>>>> set queue batch resources_default.walltime = 01:00:00
>>>>> set queue batch enabled = True
>>>>> set queue batch started = True
>>>>> #
>>>>> # Set server attributes.
>>>>> #
>>>>> set server scheduling = True
>>>>> set server acl_hosts = titan1.am1.mnet
>>>>> set server managers = kevin at titan1.am1.mnet
>>>>> set server managers += root at titan1.am1.mnet
>>>>> set server operators = kevin at titan1.am1.mnet
>>>>> set server operators += root at titan1.am1.mnet
>>>>> set server default_queue = batch
>>>>> set server log_events = 511
>>>>> set server mail_from = adm
>>>>> set server scheduler_iteration = 600
>>>>> set server node_check_rate = 150
>>>>> set server tcp_timeout = 300
>>>>> set server job_stat_rate = 45
>>>>> set server poll_jobs = True
>>>>> set server mom_job_sync = True
>>>>> set server keep_completed = 300
>>>>> set server submit_hosts = titan1.am1.mnet
>>>>> set server next_job_number = 8
>>>>> set server moab_array_compatible = True
>>>>> set server nppcu = 1
>>>>>
>>>>> My torque nodes file is:
>>>>>
>>>>> titan1.am1.mnet np=16 RAM64GB
>>>>> titan2.am1.mnet np=16 RAM64GB
>>>>> amdfl1.am1.mnet np=16 RAM64GB
>>>>> amdfr1.am1.mnet np=16 RAM64GB
>>>>> pegasus.am1.mnet np=32 RAM128GB
>>>>>
>>>>> Our maui.cfg file is:
>>>>>
>>>>> # maui.cfg 3.3.1
>>>>>
>>>>> SERVERHOST            titan1.am1.mnet
>>>>> # primary admin must be first in list
>>>>> ADMIN1                root kevin
>>>>> ADMIN3              ALL
>>>>>
>>>>> # Resource Manager Definition
>>>>>
>>>>> RMCFG[TITAN1.AM1.MNET] TYPE=PBS
>>>>>
>>>>> # Allocation Manager Definition
>>>>>
>>>>> AMCFG[bank]  TYPE=NONE
>>>>>
>>>>> # full parameter docs at
>>>>> http://supercluster.org/mauidocs/a.fparameters.html
>>>>> # use the 'schedctl -l' command to display current configuration
>>>>>
>>>>> RMPOLLINTERVAL        00:00:30
>>>>>
>>>>> SERVERPORT            42559
>>>>> SERVERMODE            NORMAL
>>>>>
>>>>> # Admin: http://supercluster.org/mauidocs/a.esecurity.html
>>>>>
>>>>>
>>>>> LOGFILE               maui.log
>>>>> LOGFILEMAXSIZE        10000000
>>>>> LOGLEVEL              3
>>>>>
>>>>> # Job Priority:
>>>>> http://supercluster.org/mauidocs/5.1jobprioritization.html
>>>>>
>>>>> QUEUETIMEWEIGHT       1
>>>>>
>>>>> # FairShare: http://supercluster.org/mauidocs/6.3fairshare.html
>>>>>
>>>>> #FSPOLICY              PSDEDICATED
>>>>> #FSDEPTH               7
>>>>> #FSINTERVAL            86400
>>>>> #FSDECAY               0.80
>>>>>
>>>>> # Throttling Policies:
>>>>> http://supercluster.org/mauidocs/6.2throttlingpolicies.html
>>>>>
>>>>> # NONE SPECIFIED
>>>>>
>>>>> # Backfill: http://supercluster.org/mauidocs/8.2backfill.html
>>>>>
>>>>> BACKFILLPOLICY        FIRSTFIT
>>>>> RESERVATIONPOLICY     CURRENTHIGHEST
>>>>>
>>>>> # Node Allocation:
>>>>> http://supercluster.org/mauidocs/5.2nodeallocation.html
>>>>>
>>>>> NODEALLOCATIONPOLICY  MINRESOURCE
>>>>>
>>>>> # Kevin's Modifications:
>>>>>
>>>>> JOBNODEMATCHPOLICY EXACTNODE
>>>>>
>>>>>
>>>>> # QOS: http://supercluster.org/mauidocs/7.3qos.html
>>>>>
>>>>> # QOSCFG[hi]  PRIORITY=100 XFTARGET=100 FLAGS=PREEMPTOR:IGNMAXJOB
>>>>> # QOSCFG[low] PRIORITY=-1000 FLAGS=PREEMPTEE
>>>>>
>>>>> # Standing Reservations:
>>>>> http://supercluster.org/mauidocs/7.1.3standingreservations.html
>>>>>
>>>>> # SRSTARTTIME[test] 8:00:00
>>>>> # SRENDTIME[test]   17:00:00
>>>>> # SRDAYS[test]      MON TUE WED THU FRI
>>>>> # SRTASKCOUNT[test] 20
>>>>> # SRMAXTIME[test]   0:30:00
>>>>>
>>>>> # Creds: http://supercluster.org/mauidocs/6.1fairnessoverview.html
>>>>>
>>>>> # USERCFG[DEFAULT]      FSTARGET=25.0
>>>>> # USERCFG[john]         PRIORITY=100  FSTARGET=10.0-
>>>>> # GROUPCFG[staff]       PRIORITY=1000 QLIST=hi:low QDEF=hi
>>>>> # CLASSCFG[batch]       FLAGS=PREEMPTEE
>>>>> # CLASSCFG[interactive] FLAGS=PREEMPTOR
>>>>>
>>>>> Our MOM config file is:
>>>>>
>>>>> $pbsserver    10.0.0.10    # IP address of titan1.am1.mnet
>>>>> $clienthost    10.0.0.10    # IP address of management node
>>>>> $usecp        *:/home/kevin /home/kevin
>>>>> $usecp        *:/home /home
>>>>> $usecp        *:/root /root
>>>>> $usecp        *:/home/mpi /home/mpi
>>>>> $tmpdir        /home/mpi/tmp
>>>>>
>>>>> I am finding it difficult to identify the configuration issue. I
>>>>> thought this thread would help:
>>>>>
>>>>> http://comments.gmane.org/gmane.comp.clustering.maui.user/2859
>>>>>
>>>>> but their examples show the machine file is working correctly and they
>>>>> are battling memory allocations. I can't seem to get that far yet. Any
>>>>> thoughts?
>>>>>
>>>>> --
>>>>> Kevin Sutherland
>>>>> Simulations Specialist
>>>>>
>>>>> _______________________________________________
>>>>> torqueusers mailing list
>>>>> torqueusers at supercluster.org
>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Ken Nielson
>>>> +1 801.717.3700 office +1 801.717.3738 fax
>>>> 1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
>>>> www.adaptivecomputing.com
>>>>
>>>>
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>
>>>>
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>
>>
>> --
>> Ken Nielson
>> +1 801.717.3700 office +1 801.717.3738 fax
>> 1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
>> www.adaptivecomputing.com
>>
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>
>
> --
> Kevin Sutherland
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>


-- 
Ken Nielson
+1 801.717.3700 office +1 801.717.3738 fax
1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
www.adaptivecomputing.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131218/a27153e2/attachment.html 


More information about the torqueusers mailing list