[torqueusers] pbs -l procs=n syntax defaults to 1

Ken Nielson knielson at adaptivecomputing.com
Wed Dec 18 08:52:12 MST 2013


Hi all,

Just as maybe a clue to what is happening, the procs is working as expected
with Moab as the scheduler.




On Tue, Dec 17, 2013 at 4:53 PM, Kevin Sutherland <
sutherland.kevinr at gmail.com> wrote:

> To Ken's response...I would like to know how? The issue may be in our
> queue config (I attached this in my first post). I am unsure how to get the
> procs syntax working. We had the issue without Maui in the mix as well.
> Does anyone have a working setup with the procs syntax they could walk
> through with me? Even if it's just copies of the config files with the
> pertinent syntax described...we would really like to avoid going the
> commercial route if we can avoid it (HPC Suite from Adaptive Computing).
>
> Thanks,
> -Kevin
>
>
> On Tue, Dec 17, 2013 at 10:14 AM, Ken Nielson <
> knielson at adaptivecomputing.com> wrote:
>
>> Glen,
>>
>> You are right. My mistake. procs does work.
>>
>> Ken
>>
>>
>>
>>
>> On Mon, Dec 16, 2013 at 1:26 PM, Glen Beane <glen.beane at gmail.com> wrote:
>>
>>> I thought this had been fixed and procs had been made a real resource in
>>> Torque (meaning it works as expected with qrun or pbs_sched).  I think the
>>> problem here is Maui.
>>>
>>>
>>> On Mon, Dec 16, 2013 at 2:51 PM, Ken Nielson <
>>> knielson at adaptivecomputing.com> wrote:
>>>
>>>> Kevin,
>>>>
>>>> procs is a pass through resource for TORQUE. That is, TORQUE only
>>>> allows it to be accepted because it will hand it to the scheduler and the
>>>> scheduler will interpret the command. Depending on how you have qmgr
>>>> configured the default number of nodes for a job is one with just one proc
>>>> from TORQUE.
>>>>
>>>> You could use -l nodes=x instead. Otherwise, it is up to Maui to
>>>> interpret the meaning of procs.
>>>>
>>>>
>>>> On Mon, Dec 16, 2013 at 11:42 AM, Kevin Sutherland <
>>>> sutherland.kevinr at gmail.com> wrote:
>>>>
>>>>> Greetings,
>>>>>
>>>>> I have posted this on both torque and maui user boards as I am unsure
>>>>> whether the issue is in maui or torque (although we had this same problem
>>>>> before we ran maui)
>>>>>
>>>>> I am configuring a cluster for engineering simulation use at my
>>>>> office. We have two clusters (one with 12 nodes and 16 processors per node
>>>>> and the other is a 5 node cluster with 16 processors per node, except for a
>>>>> bigmem machine with 32 processors).
>>>>>
>>>>> I am only working on the 5 node cluster at this time, but the behavior
>>>>> I am dealing with is on both clusters. When the procs syntax is used, the
>>>>> system is defaulting to 1 process, even though procs is > 1. All nodes show
>>>>> free when issuing qnodes or pbsnodes -a and list the appropriate number of
>>>>> cpus defined in the nodes file.
>>>>>
>>>>> I have a simple test script:
>>>>>
>>>>> #!/bin/bash
>>>>>
>>>>> #PBS -S /bin/bash
>>>>> #PBS -l nodes=2:ppn=8
>>>>> #PBS -j oe
>>>>>
>>>>> cat $PBS_NODEFILE
>>>>>
>>>>> This script prints out:
>>>>>
>>>>> pegasus.am1.mnet
>>>>> pegasus.am1.mnet
>>>>> pegasus.am1.mnet
>>>>> pegasus.am1.mnet
>>>>> pegasus.am1.mnet
>>>>> pegasus.am1.mnet
>>>>> pegasus.am1.mnet
>>>>> pegasus.am1.mnet
>>>>> amdfr1.am1.mnet
>>>>> amdfr1.am1.mnet
>>>>> amdfr1.am1.mnet
>>>>> amdfr1.am1.mnet
>>>>> amdfr1.am1.mnet
>>>>> amdfr1.am1.mnet
>>>>> amdfr1.am1.mnet
>>>>> amdfr1.am1.mnet
>>>>>
>>>>> Which is expected. When I change the PBS resource list to:
>>>>>
>>>>> #PBS -l procs=32
>>>>>
>>>>> I get the following:
>>>>>
>>>>> pegasus.am1.mnet
>>>>>
>>>>> The machine filed created in /var/spool/torque/aux simply has 1 entry
>>>>> for 1 process, even though I requested 32. We have a piece of simulation
>>>>> software that REQUIRES the use of the "-l procs=n" syntax to function on
>>>>> the cluster. (ANSYS does not plan to permit changes to this until Release
>>>>> 16 in 2015) We are trying to use our cluster with Ansys RSM with CFX and
>>>>> Fluent.
>>>>>
>>>>> We are running torque 4.2.6.1 and Maui 3.3.1.
>>>>>
>>>>> My queue and server attributes are defined as follows:
>>>>>
>>>>> #
>>>>> # Create queues and set their attributes.
>>>>> #
>>>>> #
>>>>> # Create and define queue batch
>>>>> #
>>>>> create queue batch
>>>>> set queue batch queue_type = Execution
>>>>> set queue batch resources_default.walltime = 01:00:00
>>>>> set queue batch enabled = True
>>>>> set queue batch started = True
>>>>> #
>>>>> # Set server attributes.
>>>>> #
>>>>> set server scheduling = True
>>>>> set server acl_hosts = titan1.am1.mnet
>>>>> set server managers = kevin at titan1.am1.mnet
>>>>> set server managers += root at titan1.am1.mnet
>>>>> set server operators = kevin at titan1.am1.mnet
>>>>> set server operators += root at titan1.am1.mnet
>>>>> set server default_queue = batch
>>>>> set server log_events = 511
>>>>> set server mail_from = adm
>>>>> set server scheduler_iteration = 600
>>>>> set server node_check_rate = 150
>>>>> set server tcp_timeout = 300
>>>>> set server job_stat_rate = 45
>>>>> set server poll_jobs = True
>>>>> set server mom_job_sync = True
>>>>> set server keep_completed = 300
>>>>> set server submit_hosts = titan1.am1.mnet
>>>>> set server next_job_number = 8
>>>>> set server moab_array_compatible = True
>>>>> set server nppcu = 1
>>>>>
>>>>> My torque nodes file is:
>>>>>
>>>>> titan1.am1.mnet np=16 RAM64GB
>>>>> titan2.am1.mnet np=16 RAM64GB
>>>>> amdfl1.am1.mnet np=16 RAM64GB
>>>>> amdfr1.am1.mnet np=16 RAM64GB
>>>>> pegasus.am1.mnet np=32 RAM128GB
>>>>>
>>>>> Our maui.cfg file is:
>>>>>
>>>>> # maui.cfg 3.3.1
>>>>>
>>>>> SERVERHOST            titan1.am1.mnet
>>>>> # primary admin must be first in list
>>>>> ADMIN1                root kevin
>>>>> ADMIN3              ALL
>>>>>
>>>>> # Resource Manager Definition
>>>>>
>>>>> RMCFG[TITAN1.AM1.MNET] TYPE=PBS
>>>>>
>>>>> # Allocation Manager Definition
>>>>>
>>>>> AMCFG[bank]  TYPE=NONE
>>>>>
>>>>> # full parameter docs at
>>>>> http://supercluster.org/mauidocs/a.fparameters.html
>>>>> # use the 'schedctl -l' command to display current configuration
>>>>>
>>>>> RMPOLLINTERVAL        00:00:30
>>>>>
>>>>> SERVERPORT            42559
>>>>> SERVERMODE            NORMAL
>>>>>
>>>>> # Admin: http://supercluster.org/mauidocs/a.esecurity.html
>>>>>
>>>>>
>>>>> LOGFILE               maui.log
>>>>> LOGFILEMAXSIZE        10000000
>>>>> LOGLEVEL              3
>>>>>
>>>>> # Job Priority:
>>>>> http://supercluster.org/mauidocs/5.1jobprioritization.html
>>>>>
>>>>> QUEUETIMEWEIGHT       1
>>>>>
>>>>> # FairShare: http://supercluster.org/mauidocs/6.3fairshare.html
>>>>>
>>>>> #FSPOLICY              PSDEDICATED
>>>>> #FSDEPTH               7
>>>>> #FSINTERVAL            86400
>>>>> #FSDECAY               0.80
>>>>>
>>>>> # Throttling Policies:
>>>>> http://supercluster.org/mauidocs/6.2throttlingpolicies.html
>>>>>
>>>>> # NONE SPECIFIED
>>>>>
>>>>> # Backfill: http://supercluster.org/mauidocs/8.2backfill.html
>>>>>
>>>>> BACKFILLPOLICY        FIRSTFIT
>>>>> RESERVATIONPOLICY     CURRENTHIGHEST
>>>>>
>>>>> # Node Allocation:
>>>>> http://supercluster.org/mauidocs/5.2nodeallocation.html
>>>>>
>>>>> NODEALLOCATIONPOLICY  MINRESOURCE
>>>>>
>>>>> # Kevin's Modifications:
>>>>>
>>>>> JOBNODEMATCHPOLICY EXACTNODE
>>>>>
>>>>>
>>>>> # QOS: http://supercluster.org/mauidocs/7.3qos.html
>>>>>
>>>>> # QOSCFG[hi]  PRIORITY=100 XFTARGET=100 FLAGS=PREEMPTOR:IGNMAXJOB
>>>>> # QOSCFG[low] PRIORITY=-1000 FLAGS=PREEMPTEE
>>>>>
>>>>> # Standing Reservations:
>>>>> http://supercluster.org/mauidocs/7.1.3standingreservations.html
>>>>>
>>>>> # SRSTARTTIME[test] 8:00:00
>>>>> # SRENDTIME[test]   17:00:00
>>>>> # SRDAYS[test]      MON TUE WED THU FRI
>>>>> # SRTASKCOUNT[test] 20
>>>>> # SRMAXTIME[test]   0:30:00
>>>>>
>>>>> # Creds: http://supercluster.org/mauidocs/6.1fairnessoverview.html
>>>>>
>>>>> # USERCFG[DEFAULT]      FSTARGET=25.0
>>>>> # USERCFG[john]         PRIORITY=100  FSTARGET=10.0-
>>>>> # GROUPCFG[staff]       PRIORITY=1000 QLIST=hi:low QDEF=hi
>>>>> # CLASSCFG[batch]       FLAGS=PREEMPTEE
>>>>> # CLASSCFG[interactive] FLAGS=PREEMPTOR
>>>>>
>>>>> Our MOM config file is:
>>>>>
>>>>> $pbsserver    10.0.0.10    # IP address of titan1.am1.mnet
>>>>> $clienthost    10.0.0.10    # IP address of management node
>>>>> $usecp        *:/home/kevin /home/kevin
>>>>> $usecp        *:/home /home
>>>>> $usecp        *:/root /root
>>>>> $usecp        *:/home/mpi /home/mpi
>>>>> $tmpdir        /home/mpi/tmp
>>>>>
>>>>> I am finding it difficult to identify the configuration issue. I
>>>>> thought this thread would help:
>>>>>
>>>>> http://comments.gmane.org/gmane.comp.clustering.maui.user/2859
>>>>>
>>>>> but their examples show the machine file is working correctly and they
>>>>> are battling memory allocations. I can't seem to get that far yet. Any
>>>>> thoughts?
>>>>>
>>>>> --
>>>>> Kevin Sutherland
>>>>> Simulations Specialist
>>>>>
>>>>> _______________________________________________
>>>>> torqueusers mailing list
>>>>> torqueusers at supercluster.org
>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Ken Nielson
>>>> +1 801.717.3700 office +1 801.717.3738 fax
>>>> 1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
>>>> www.adaptivecomputing.com
>>>>
>>>>
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>
>>>>
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>
>>
>> --
>> Ken Nielson
>> +1 801.717.3700 office +1 801.717.3738 fax
>> 1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
>> www.adaptivecomputing.com
>>
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>
>
> --
> Kevin Sutherland
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>


-- 
Ken Nielson
+1 801.717.3700 office +1 801.717.3738 fax
1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
www.adaptivecomputing.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131218/fd662257/attachment-0001.html 


More information about the torqueusers mailing list