[Mauiusers] RESOURCELIMITPOLICY PROC seems to work task-wise rather than across all tasks?

Lech Nieroda lnieroda at gmail.com
Wed Mar 11 03:50:01 MDT 2009


Dear list,

I'm trying to set up a limit on the number of used processors, so that
a job which uses more cores than requested at the time of submit is
cancelled, preferably after some grace time has passed.
According to the manual the right config would be

  RESOURCELIMITPOLICY PROC:EXTENDEDVIOLATION:CANCEL:00:05:00

which monitors the actual load and should cancel a job if a violation
takes longer than 5 minutes.
The problem: it kills any job that exceeds load 1 even if it declares
several cores at submit time (and it doesn't wait 5 minutes to do so
but that's another issue).

For example, let's say I submit a job with -l nodes=1:ppn=4,mem=2000m
which uses 4 cores.
It's soon killed with the following comment in the logs:
job '41975' in state 'Running' has exceeded PROC resource limit (394 >
100) (action CANCEL will be taken)

The command 'diagnose -j' says:
  Name  State Par Proc QOS     WCLimit R  Min     User    Group
Account  QueuedTime  Network  Opsys   Arch    Mem   Disk  Procs
Class Features

  41975  Running DEF    4 DEF 99:23:59:59 1    4 user  uniuser
-    00:01:35   [NONE] [NONE] [NONE]    >=0    >=0    NC0 [default:1]
[NONE]
  WARNING:  job '41975' utilizes more procs than dedicated (3.94 > 1)

Note that 'Proc' is '4' as it should be, however maui claims that only
one processor is dedicated.

'checkjob -v 41975' says:
...
  Req[0]  TaskCount: 4  Partition: DEFAULT
  Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
  Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
  Exec:  ''  ExecSize: 0  ImageSize: 0
  Dedicated Resources Per Task: PROCS: 1  MEM: 500M
  Utilized Resources Per Task:  PROCS: 3.94  MEM: 1.15  SWAP: 5.87
  Avg Util Resources Per Task:  PROCS: 3.94
  Max Util Resources Per Task:  PROCS: 3.94  MEM: 1.15  SWAP: 5.87
  Average Utilized Memory: 664.63 MB
  Average Utilized Procs: 10.48
  NodeAccess: SHARED
  TasksPerNode: 4  NodeCount: 1
  Allocated Nodes:
...
  Reservation '41975' (-00:01:33 -> 99:23:58:26  Duration: 99:23:59:59)
  PE:  4.00  StartPriority:  19821

What seems to be happening here is that the required resources (4
cores, 2g mem) are divided equally in 4 tasks with 1 core, 500m mem
each; the four processes which generate the load 3.94 are for some
reason assigned to only one task rather than all 4 and this 3.94>1
'violation' triggers the cancelling of the job.

Any idea how to make this work? Is there a way to set the trigger to
all tasks rather than just one?
We are using maui-3.2.6p19.

Regards,
Lech


More information about the mauiusers mailing list