[torqueusers] Using "ncpus" confuses scheduler

Nick Lindberg nlindberg at mkei.org
Thu Mar 27 10:44:21 MDT 2014


Fantastic.  That¹s the qmgr command I was looking for that was eluding me.
 Thank you!


Nick Lindberg
Director of Engineering
Milwaukee Institute
414-727-6413 (office)
608-215-3508 (mobile)
nlindberg at mkei.org | www.mkei.org <http://www.mkei.org>




On 3/27/14, 11:13 AM, "Martin Siegert" <siegert at sfu.ca> wrote:

>Hi Nick,
>
>On Thu, Mar 27, 2014 at 02:09:03PM +0000, Nick Lindberg wrote:
>> 
>>    Hello,
>> 
>>    I am seeing some weird behavior that I think I know the culprit of,
>>but
>>    would like a second pair of eyes.  I have a user who has been
>>    submitting jobs using
>> 
>>    #PBS ­l ncpus=4
>> 
>>    What is happening is that this job is getting scheduled on a 16 core
>>    node, but thinks that it is taking all 16 processors when really it¹s
>>    only requesting 4.  There is this weird ³Attributes² line in
>>checknode
>>    output.  I¹ve pasted the output below.  You can see there is one
>>    reservation requesting 4 processors, but it thinks dedicated
>>resources
>>    is at 16, and it says
>> 
>>    Attributes:       Processors=4
>> 
>>    almost like it¹s multiplying requested processors by that attribute.
>> I
>>    have no idea where that attribute comes from.  And what is happening
>>is
>>    Moab thinks these nodes are full, but really they¹re not and my
>>cluster
>>    is only running at 60% utilization (which is reported correctly in
>>    showq.)
>> 
>>    What does torque do with ncpus, and is there a way for me to not only
>>    discourage but disallow this Torque pragma?  I think ³procs=4³ or
>>    ³nodes=1:ppn=4²  behaves normally.  Has anybody ever seen this?
>> 
>>    [root at bright ~]# checknode -v compute-002
>>    node compute-002
>>    State:      Busy  (in current state for 00:00:23)
>>    Configured Resources: PROCS: 16  MEM: 62G  SWAP: 74G  DISK: 1M
>>    Utilized   Resources: PROCS: 16  SWAP: 6285M
>>    Dedicated  Resources: PROCS: 16
>>    Attributes:         Processors=4
>>      MTBF(longterm):   INFINITY  MTBF(24h):   INFINITY
>>    Opsys:      linux     Arch:      ---
>>    Speed:      1.00      CPULoad:   1.000
>>    Partition:  torque  Rack/Slot:  ---  NodeIndex:  2
>>    IdleTime:   58:18:24:38
>>    Classes:    [batch]
>>    RM[torque]* TYPE=PBS
>>    EffNodeAccessPolicy: SHARED
>>    Total Time:    174days  Up:    172days (98.94%)  Active: 93:14:24:12
>>    (53.57%)
>>    Reservations:
>>      17035x4  Job:Running  -2:17:28:55 -> 7:06:31:05 (10:00:00:00)
>>    Jobs:        17035
>>    ALERT:  node is in state Busy but load is low (1.000)
>
>ncpus is a relic from ancient times and should not be used.
>As far as I know torque does not handle ncpus at all - it just passes
>it through to the scheduler, which then may or may not assign resources
>to it (usually in a way not expected by the user).
>
>We simply disable the use of ncpus by setting
>
>set server resources_max.ncpus = 0
>
>That way any job submitted with ncpus gets rejected by torque
>rightaway.
>
>Cheers,
>Martin
>
>-- 
>Martin Siegert
>WestGrid/ComputeCanada
>Simon Fraser University
>Burnaby, British Columbia
>_______________________________________________
>torqueusers mailing list
>torqueusers at supercluster.org
>http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list