[Mauiusers] Does Maui respect EXTENDEDVIOLATION resource limits?

Nick Sonneveld Nicholas.Sonneveld at utas.edu.au
Sun Apr 1 20:20:09 MDT 2007


Hi guys,

I think I found the problem.  I used 'changeparam' when I really should 
have restarted the maui process.  After I restarted the scheduler, it 
didn't seem to kill a job until after the time limit.

- Nick


Nick Sonneveld wrote:
> Hullo,
> 
> I'm running maui 3.2.6p19-snap.1169758944 and I'm having trouble trying 
> to get it to allow resource overruns for a short time.
> 
> Current settings:
> 
> whiteout:/var/spool/maui # /apps/maui/bin/showconfig  -v | grep 
> RESOURCELIMITPOLICY
> RESOURCELIMITPOLICY[0]            PROC:EXTENDEDVIOLATION:CANCEL:00:15:00 
> MEM:ALWAYS:CANCEL
> whiteout:/var/spool/maui #
> 
> 
> However, looking at the logs today, I saw:
> 
> whiteout:/var/spool/maui/log # grep -i 'violation' maui.log
> 03/07 11:36:25 MSysRegEvent(JOBRESVIOLATION:  job '3648' in state 
> 'Running' has exceeded PROC resource limit (141 > 100) (action CANCEL 
> will be taken)  job start time: Wed Mar  7 11:35:32
> 03/07 11:36:25 ALERT:    limit violation action CANCEL succeeded
> 
> and
> 
> whiteout:/var/spool/maui/log # tracejob 3648
> 
> Job: 3648.whiteout.sf.utas.edu.au
> 
> 03/07/2007 00:40:01  S    enqueuing into batch, state 1 hop 1
> 03/07/2007 00:40:01  S    Job Queued at request of
>                           prachab at whiteout.sf.utas.edu.au, owner =
>                           prachab at whiteout.sf.utas.edu.au, job name = 
> Test2_4C,
>                           queue = batch
> 03/07/2007 00:40:01  A    queue=batch
> 03/07/2007 11:35:32  S    Job Modified at request of
>                           maui at whiteout.sf.utas.edu.au
> 03/07/2007 11:35:32  S    Job Run at request of 
> maui at whiteout.sf.utas.edu.au
> 03/07/2007 11:35:33  M    Job Modified at request of
>                           PBS_Server at whiteout.sf.utas.edu.au
> 03/07/2007 11:35:33  S    Job Modified at request of
>                           maui at whiteout.sf.utas.edu.au
> 03/07/2007 11:35:33  A    user=prachab group=users jobname=Test2_4C 
> queue=batch
>                           ctime=1173188401 qtime=1173188401 
> etime=1173188401
>                           start=1173227733 exec_host=whiteout
>                           Resource_List.mem=2000mb Resource_List.ncpus=1
>                           Resource_List.neednodes=whiteout
>                           Resource_List.nodect=1
>                           Resource_List.walltime=1000:00:00
> 03/07/2007 11:36:25  S    Job deleted at request of 
> maui at whiteout.sf.utas.edu.au03/07/2007 11:36:25  S    Job sent signal 
> SIGTERM on delete
> 03/07/2007 11:36:25  M    kill_task: killing pid 32547 task 1 with sig 15
> 03/07/2007 11:36:25  M    kill_task: killing pid 32569 task 1 with sig 15
> 03/07/2007 11:36:25  M    kill_task: killing pid 32574 task 1 with sig 15
> 03/07/2007 11:36:25  M    kill_task: killing pid 32615 task 1 with sig 15
> 03/07/2007 11:36:25  A    requestor=maui at whiteout.sf.utas.edu.au
> 03/07/2007 11:36:28  S    Exit_status=143 resources_used.cput=00:00:46
>                           resources_used.mem=300784kb
>                           resources_used.vmem=341792kb
>                           resources_used.walltime=00:00:52
> 03/07/2007 11:36:28  M    kill_task: killing pid 32615 task 1 with sig 9
> 03/07/2007 11:36:28  M    scan_for_terminated: job 
> 3648.whiteout.sf.utas.edu.au
>                           task 1 terminated, sid 32547
> 03/07/2007 11:36:28  M    job was terminated
> 03/07/2007 11:36:28  A    user=prachab group=users jobname=Test2_4C 
> queue=batch
>                           ctime=1173188401 qtime=1173188401 
> etime=1173188401
>                           start=1173227733 exec_host=whiteout
>                           Resource_List.mem=2000mb Resource_List.ncpus=1
>                           Resource_List.neednodes=batch 
> Resource_List.nodect=1
>                           Resource_List.walltime=1000:00:00 session=32547
>                           end=1173227788 Exit_status=143
>                           resources_used.cput=00:00:46
>                           resources_used.mem=300784kb
>                           resources_used.vmem=341792kb
>                           resources_used.walltime=00:00:52
> 03/07/2007 11:36:37  S    dequeuing from batch, state COMPLETE
> 
> 
> It looks like Maui didn't wait a full 15 minutes before killing the job. 
>    Is there something wrong with my config?
> 
> - Nick
> 

-- 
Nick Sonneveld  |  Nicholas.Sonneveld at utas.edu.au
IT Resources, University of Tasmania, Private Bag 69, Hobart Tas 7001
(03) 6226 6377  |  0407 336 309  |  Fax (03) 6226 7171


More information about the mauiusers mailing list