[torqueusers] Using "ncpus" confuses scheduler

Martin Siegert siegert at sfu.ca
Thu Mar 27 10:13:11 MDT 2014


Hi Nick,

On Thu, Mar 27, 2014 at 02:09:03PM +0000, Nick Lindberg wrote:
> 
>    Hello,
> 
>    I am seeing some weird behavior that I think I know the culprit of, but
>    would like a second pair of eyes.  I have a user who has been
>    submitting jobs using
> 
>    #PBS –l ncpus=4
> 
>    What is happening is that this job is getting scheduled on a 16 core
>    node, but thinks that it is taking all 16 processors when really it’s
>    only requesting 4.  There is this weird “Attributes” line in checknode
>    output.  I’ve pasted the output below.  You can see there is one
>    reservation requesting 4 processors, but it thinks dedicated resources
>    is at 16, and it says
> 
>    Attributes:       Processors=4
> 
>    almost like it’s multiplying requested processors by that attribute.  I
>    have no idea where that attribute comes from.  And what is happening is
>    Moab thinks these nodes are full, but really they’re not and my cluster
>    is only running at 60% utilization (which is reported correctly in
>    showq.)
> 
>    What does torque do with ncpus, and is there a way for me to not only
>    discourage but disallow this Torque pragma?  I think “procs=4“ or
>    “nodes=1:ppn=4”  behaves normally.  Has anybody ever seen this?
> 
>    [root at bright ~]# checknode -v compute-002
>    node compute-002
>    State:      Busy  (in current state for 00:00:23)
>    Configured Resources: PROCS: 16  MEM: 62G  SWAP: 74G  DISK: 1M
>    Utilized   Resources: PROCS: 16  SWAP: 6285M
>    Dedicated  Resources: PROCS: 16
>    Attributes:         Processors=4
>      MTBF(longterm):   INFINITY  MTBF(24h):   INFINITY
>    Opsys:      linux     Arch:      ---
>    Speed:      1.00      CPULoad:   1.000
>    Partition:  torque  Rack/Slot:  ---  NodeIndex:  2
>    IdleTime:   58:18:24:38
>    Classes:    [batch]
>    RM[torque]* TYPE=PBS
>    EffNodeAccessPolicy: SHARED
>    Total Time:    174days  Up:    172days (98.94%)  Active: 93:14:24:12
>    (53.57%)
>    Reservations:
>      17035x4  Job:Running  -2:17:28:55 -> 7:06:31:05 (10:00:00:00)
>    Jobs:        17035
>    ALERT:  node is in state Busy but load is low (1.000)

ncpus is a relic from ancient times and should not be used.
As far as I know torque does not handle ncpus at all - it just passes
it through to the scheduler, which then may or may not assign resources
to it (usually in a way not expected by the user).

We simply disable the use of ncpus by setting

set server resources_max.ncpus = 0

That way any job submitted with ncpus gets rejected by torque
rightaway.

Cheers,
Martin

-- 
Martin Siegert
WestGrid/ComputeCanada
Simon Fraser University
Burnaby, British Columbia


More information about the torqueusers mailing list