[Mauiusers] RESOURCELIMITPOLICY PROC seems to work task-wise rather
than across all tasks?
Lech Nieroda
lnieroda at gmail.com
Wed Mar 11 03:50:01 MDT 2009
Dear list,
I'm trying to set up a limit on the number of used processors, so that
a job which uses more cores than requested at the time of submit is
cancelled, preferably after some grace time has passed.
According to the manual the right config would be
RESOURCELIMITPOLICY PROC:EXTENDEDVIOLATION:CANCEL:00:05:00
which monitors the actual load and should cancel a job if a violation
takes longer than 5 minutes.
The problem: it kills any job that exceeds load 1 even if it declares
several cores at submit time (and it doesn't wait 5 minutes to do so
but that's another issue).
For example, let's say I submit a job with -l nodes=1:ppn=4,mem=2000m
which uses 4 cores.
It's soon killed with the following comment in the logs:
job '41975' in state 'Running' has exceeded PROC resource limit (394 >
100) (action CANCEL will be taken)
The command 'diagnose -j' says:
Name State Par Proc QOS WCLimit R Min User Group
Account QueuedTime Network Opsys Arch Mem Disk Procs
Class Features
41975 Running DEF 4 DEF 99:23:59:59 1 4 user uniuser
- 00:01:35 [NONE] [NONE] [NONE] >=0 >=0 NC0 [default:1]
[NONE]
WARNING: job '41975' utilizes more procs than dedicated (3.94 > 1)
Note that 'Proc' is '4' as it should be, however maui claims that only
one processor is dedicated.
'checkjob -v 41975' says:
...
Req[0] TaskCount: 4 Partition: DEFAULT
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [NONE]
Exec: '' ExecSize: 0 ImageSize: 0
Dedicated Resources Per Task: PROCS: 1 MEM: 500M
Utilized Resources Per Task: PROCS: 3.94 MEM: 1.15 SWAP: 5.87
Avg Util Resources Per Task: PROCS: 3.94
Max Util Resources Per Task: PROCS: 3.94 MEM: 1.15 SWAP: 5.87
Average Utilized Memory: 664.63 MB
Average Utilized Procs: 10.48
NodeAccess: SHARED
TasksPerNode: 4 NodeCount: 1
Allocated Nodes:
...
Reservation '41975' (-00:01:33 -> 99:23:58:26 Duration: 99:23:59:59)
PE: 4.00 StartPriority: 19821
What seems to be happening here is that the required resources (4
cores, 2g mem) are divided equally in 4 tasks with 1 core, 500m mem
each; the four processes which generate the load 3.94 are for some
reason assigned to only one task rather than all 4 and this 3.94>1
'violation' triggers the cancelling of the job.
Any idea how to make this work? Is there a way to set the trigger to
all tasks rather than just one?
We are using maui-3.2.6p19.
Regards,
Lech
More information about the mauiusers
mailing list