[Mauiusers] RESOURCELIMITPOLICY PROC seems to work task-wise rather
than across all tasks?
lnieroda at gmail.com
Wed Mar 11 03:50:01 MDT 2009
I'm trying to set up a limit on the number of used processors, so that
a job which uses more cores than requested at the time of submit is
cancelled, preferably after some grace time has passed.
According to the manual the right config would be
which monitors the actual load and should cancel a job if a violation
takes longer than 5 minutes.
The problem: it kills any job that exceeds load 1 even if it declares
several cores at submit time (and it doesn't wait 5 minutes to do so
but that's another issue).
For example, let's say I submit a job with -l nodes=1:ppn=4,mem=2000m
which uses 4 cores.
It's soon killed with the following comment in the logs:
job '41975' in state 'Running' has exceeded PROC resource limit (394 >
100) (action CANCEL will be taken)
The command 'diagnose -j' says:
Name State Par Proc QOS WCLimit R Min User Group
Account QueuedTime Network Opsys Arch Mem Disk Procs
41975 Running DEF 4 DEF 99:23:59:59 1 4 user uniuser
- 00:01:35 [NONE] [NONE] [NONE] >=0 >=0 NC0 [default:1]
WARNING: job '41975' utilizes more procs than dedicated (3.94 > 1)
Note that 'Proc' is '4' as it should be, however maui claims that only
one processor is dedicated.
'checkjob -v 41975' says:
Req TaskCount: 4 Partition: DEFAULT
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [NONE]
Exec: '' ExecSize: 0 ImageSize: 0
Dedicated Resources Per Task: PROCS: 1 MEM: 500M
Utilized Resources Per Task: PROCS: 3.94 MEM: 1.15 SWAP: 5.87
Avg Util Resources Per Task: PROCS: 3.94
Max Util Resources Per Task: PROCS: 3.94 MEM: 1.15 SWAP: 5.87
Average Utilized Memory: 664.63 MB
Average Utilized Procs: 10.48
TasksPerNode: 4 NodeCount: 1
Reservation '41975' (-00:01:33 -> 99:23:58:26 Duration: 99:23:59:59)
PE: 4.00 StartPriority: 19821
What seems to be happening here is that the required resources (4
cores, 2g mem) are divided equally in 4 tasks with 1 core, 500m mem
each; the four processes which generate the load 3.94 are for some
reason assigned to only one task rather than all 4 and this 3.94>1
'violation' triggers the cancelling of the job.
Any idea how to make this work? Is there a way to set the trigger to
all tasks rather than just one?
We are using maui-3.2.6p19.
More information about the mauiusers