[Mauiusers] Does Maui respect EXTENDEDVIOLATION resource limits?
Nick Sonneveld
Nicholas.Sonneveld at utas.edu.au
Sun Apr 1 20:20:09 MDT 2007
Hi guys,
I think I found the problem. I used 'changeparam' when I really should
have restarted the maui process. After I restarted the scheduler, it
didn't seem to kill a job until after the time limit.
- Nick
Nick Sonneveld wrote:
> Hullo,
>
> I'm running maui 3.2.6p19-snap.1169758944 and I'm having trouble trying
> to get it to allow resource overruns for a short time.
>
> Current settings:
>
> whiteout:/var/spool/maui # /apps/maui/bin/showconfig -v | grep
> RESOURCELIMITPOLICY
> RESOURCELIMITPOLICY[0] PROC:EXTENDEDVIOLATION:CANCEL:00:15:00
> MEM:ALWAYS:CANCEL
> whiteout:/var/spool/maui #
>
>
> However, looking at the logs today, I saw:
>
> whiteout:/var/spool/maui/log # grep -i 'violation' maui.log
> 03/07 11:36:25 MSysRegEvent(JOBRESVIOLATION: job '3648' in state
> 'Running' has exceeded PROC resource limit (141 > 100) (action CANCEL
> will be taken) job start time: Wed Mar 7 11:35:32
> 03/07 11:36:25 ALERT: limit violation action CANCEL succeeded
>
> and
>
> whiteout:/var/spool/maui/log # tracejob 3648
>
> Job: 3648.whiteout.sf.utas.edu.au
>
> 03/07/2007 00:40:01 S enqueuing into batch, state 1 hop 1
> 03/07/2007 00:40:01 S Job Queued at request of
> prachab at whiteout.sf.utas.edu.au, owner =
> prachab at whiteout.sf.utas.edu.au, job name =
> Test2_4C,
> queue = batch
> 03/07/2007 00:40:01 A queue=batch
> 03/07/2007 11:35:32 S Job Modified at request of
> maui at whiteout.sf.utas.edu.au
> 03/07/2007 11:35:32 S Job Run at request of
> maui at whiteout.sf.utas.edu.au
> 03/07/2007 11:35:33 M Job Modified at request of
> PBS_Server at whiteout.sf.utas.edu.au
> 03/07/2007 11:35:33 S Job Modified at request of
> maui at whiteout.sf.utas.edu.au
> 03/07/2007 11:35:33 A user=prachab group=users jobname=Test2_4C
> queue=batch
> ctime=1173188401 qtime=1173188401
> etime=1173188401
> start=1173227733 exec_host=whiteout
> Resource_List.mem=2000mb Resource_List.ncpus=1
> Resource_List.neednodes=whiteout
> Resource_List.nodect=1
> Resource_List.walltime=1000:00:00
> 03/07/2007 11:36:25 S Job deleted at request of
> maui at whiteout.sf.utas.edu.au03/07/2007 11:36:25 S Job sent signal
> SIGTERM on delete
> 03/07/2007 11:36:25 M kill_task: killing pid 32547 task 1 with sig 15
> 03/07/2007 11:36:25 M kill_task: killing pid 32569 task 1 with sig 15
> 03/07/2007 11:36:25 M kill_task: killing pid 32574 task 1 with sig 15
> 03/07/2007 11:36:25 M kill_task: killing pid 32615 task 1 with sig 15
> 03/07/2007 11:36:25 A requestor=maui at whiteout.sf.utas.edu.au
> 03/07/2007 11:36:28 S Exit_status=143 resources_used.cput=00:00:46
> resources_used.mem=300784kb
> resources_used.vmem=341792kb
> resources_used.walltime=00:00:52
> 03/07/2007 11:36:28 M kill_task: killing pid 32615 task 1 with sig 9
> 03/07/2007 11:36:28 M scan_for_terminated: job
> 3648.whiteout.sf.utas.edu.au
> task 1 terminated, sid 32547
> 03/07/2007 11:36:28 M job was terminated
> 03/07/2007 11:36:28 A user=prachab group=users jobname=Test2_4C
> queue=batch
> ctime=1173188401 qtime=1173188401
> etime=1173188401
> start=1173227733 exec_host=whiteout
> Resource_List.mem=2000mb Resource_List.ncpus=1
> Resource_List.neednodes=batch
> Resource_List.nodect=1
> Resource_List.walltime=1000:00:00 session=32547
> end=1173227788 Exit_status=143
> resources_used.cput=00:00:46
> resources_used.mem=300784kb
> resources_used.vmem=341792kb
> resources_used.walltime=00:00:52
> 03/07/2007 11:36:37 S dequeuing from batch, state COMPLETE
>
>
> It looks like Maui didn't wait a full 15 minutes before killing the job.
> Is there something wrong with my config?
>
> - Nick
>
--
Nick Sonneveld | Nicholas.Sonneveld at utas.edu.au
IT Resources, University of Tasmania, Private Bag 69, Hobart Tas 7001
(03) 6226 6377 | 0407 336 309 | Fax (03) 6226 7171
More information about the mauiusers
mailing list