[Mauiusers] Bug? Maui does not respect extended resource violation time if job has been idle in queue for a while.

Nick Sonneveld Nicholas.Sonneveld at utas.edu.au
Thu Apr 12 18:12:49 MDT 2007


Hi guys,

I think I've found a bug in Maui.  Is this the right place to post?

Maui does not wait the full extended violation time if the job has been 
idle in the queue for a while.   If the job starts violating a resource 
restriction immediately when it starts, then it will be killed 
immediately instead of after the violation time.   This does not happen 
if the job has not been waiting in the queue for long.

The problem is that MaxViolationTime doesn't take into account the time 
the job is in the queue.

To find this out, I inserted a line into maui to print out J->RULVTime 
and  P->ResourceLimitMaxViolationTime[VRes]:
04/12 19:57:01 MSysRegEvent(For job '4660' , is J->RULVTime (45426) < 
P->ResourceLimitMaxViolationTime[VRes] (300)  ? ,0,0,1)
J->RULVTime was a very large number despite the fact that the job had 
only just started.

Fix suggestion, reset J->RULVTime somewhere when the job starts?

Using maui-3.2.6p19

--------------------
section of code:

MLimit.c, line 296

       case mrlpExtendedViolation:

         /* determine length of violation */

         if (J->RULVTime < P->ResourceLimitMaxViolationTime[VRes])
           {
           /* ignore violation */

           ResourceLimitsExceeded = FALSE;
           }

         break;

----------------------------------------

config:

whiteout:/var/spool/maui/log # /apps/maui/bin/showconfig  -v | grep 
RESOURCELIMITPOLICY
RESOURCELIMITPOLICY[0]            PROC:EXTENDEDVIOLATION:CANCEL:00:05:00 
MEM:ALWAYS:CANCEL
whiteout:/var/spool/maui/log #



--------------------------------------

whiteout:~ # tracejob -n 2 4660

Job: 4660.whiteout.sf.utas.edu.au

04/12/2007 14:46:59  S    enqueuing into batch, state 1 hop 1
04/12/2007 14:46:59  S    Job Queued at request of
                           USERNAME at whiteout.sf.utas.edu.au, owner =
                           USERNAME at whiteout.sf.utas.edu.au, job name =
                           species3kdFREQgood, queue = batch
04/12/2007 14:46:59  A    queue=batch
04/12/2007 19:56:40  M    Job Modified at request of
                           PBS_Server at whiteout.sf.utas.edu.au
04/12/2007 19:56:40  S    Job Modified at request of
                           maui at whiteout.sf.utas.edu.au
04/12/2007 19:56:40  S    Job Run at request of maui at whiteout.sf.utas.edu.au
04/12/2007 19:56:40  S    Job Modified at request of
                           maui at whiteout.sf.utas.edu.au
04/12/2007 19:56:40  A    user=USERNAME group=users 
jobname=species3kdFREQgood
                           queue=batch ctime=1176353219 qtime=1176353219
                           etime=1176353219 start=1176371800 
exec_host=whiteout
                           Resource_List.mem=2000mb Resource_List.ncpus=1
                           Resource_List.neednodes=whiteout
                           Resource_List.nodect=1 
Resource_List.walltime=20:00:00
04/12/2007 19:57:01  S    Job deleted at request of 
maui at whiteout.sf.utas.edu.au04/12/2007 19:57:01  S    Job sent signal 
SIGTERM on delete
04/12/2007 19:57:01  M    kill_task: killing pid 19086 task 1 with sig 15
04/12/2007 19:57:01  M    kill_task: killing pid 19169 task 1 with sig 15
04/12/2007 19:57:01  M    kill_task: killing pid 19213 task 1 with sig 15
04/12/2007 19:57:01  M    kill_task: killing pid 19214 task 1 with sig 15
04/12/2007 19:57:01  M    kill_task: killing pid 19227 task 1 with sig 15
04/12/2007 19:57:01  A    requestor=maui at whiteout.sf.utas.edu.au
04/12/2007 19:57:02  S    Exit_status=143 resources_used.cput=00:00:14
                           resources_used.mem=41120kb
                           resources_used.vmem=1816576kb
                           resources_used.walltime=00:00:21
04/12/2007 19:57:02  M    scan_for_terminated: job 
4660.whiteout.sf.utas.edu.au
                           task 1 terminated, sid 19086
04/12/2007 19:57:02  M    job was terminated
04/12/2007 19:57:02  A    user=USERNAME group=users 
jobname=species3kdFREQgood
                           queue=batch ctime=1176353219 qtime=1176353219
                           etime=1176353219 start=1176371800 
exec_host=whiteout
                           Resource_List.mem=2000mb Resource_List.ncpus=1
                           Resource_List.neednodes=batch 
Resource_List.nodect=1
                           Resource_List.walltime=20:00:00 session=19086
                           end=1176371822 Exit_status=143
                           resources_used.cput=00:00:14
                           resources_used.mem=41120kb
                           resources_used.vmem=1816576kb
                           resources_used.walltime=00:00:21
04/12/2007 19:57:08  S    dequeuing from batch, state COMPLETE
whiteout:~ #

-----------------------------------

whiteout:/var/spool/maui/log # grep 4660 maui.log.1
.....
.....
04/12 19:56:33 MJobPReserve(4660,DEFAULT,ResCount,ResCountRej)
04/12 19:56:40 MPBSJobUpdate(4660,4660.whiteout.sf.utas.edu.au,TaskList,0)
04/12 19:56:40 INFO:     26 feasible tasks found for job 4660:0 in 
partition DEFAULT (1 Needed)
04/12 19:56:40 INFO:     tasks located for job 4660:  1 of 1 required (2 
feasible)
04/12 19:56:40 MJobStart(4660)
04/12 19:56:40 MRMJobStart(4660,Msg,SC)
04/12 19:56:40 MPBSJobStart(4660,WHITEOUT.SF.UTAS.EDU.AU,Msg,SC)
04/12 19:56:40 MPBSJobModify(4660,Resource_List,Resource,whiteout)
04/12 19:56:40 MPBSJobModify(4660,Resource_List,Resource,batch)
04/12 19:56:40 INFO:     job '4660' successfully started
04/12 19:56:40 INFO:     starting job '4660'
04/12 19:56:45 MPBSJobUpdate(4660,4660.whiteout.sf.utas.edu.au,TaskList,0)
04/12 19:57:01 MPBSJobUpdate(4660,4660.whiteout.sf.utas.edu.au,TaskList,0)
04/12 19:57:01 MSysRegEvent(For job '4660' , is J->RULVTime (45426) < 
P->ResourceLimitMaxViolationTime[VRes] (300)  ? ,0,0,1)
04/12 19:57:01 MSysRegEvent(JOBRESVIOLATION:  job '4660' in state 
'Running' has exceeded PROC resource limit (200 > 100) (action CANCEL 
will be taken)  job start time: Thu Apr 12 19:56:40
04/12 19:57:01 MRMJobCancel(4660,job violates resource utilization 
policies,SC)
04/12 19:57:01 MPBSJobCancel(4660,WHITEOUT.SF.UTAS.EDU.AU,CMsg,Msg,job 
violates resource utilization policies)
04/12 19:57:01 INFO:     job '4660' successfully cancelled
04/12 19:57:04 MPBSJobUpdate(4660,4660.whiteout.sf.utas.edu.au,TaskList,0)
04/12 19:57:04 INFO:     job '4660' changed states from Running to Completed
04/12 19:57:04 MJobProcessCompleted(4660)
04/12 19:57:04 INFO:     job '4660' completed  X: 0.258361  T: 21  PS: 
21  A: 0.000292
04/12 19:57:04 MJobSendFB(4660)
04/12 19:57:04 INFO:     job usage sent for job '4660'
04/12 19:57:04 MJobRemove(4660)
04/12 19:57:04 MJobDestroy(4660)
whiteout:/var/spool/maui/log #







- Nick

-- 
Nick Sonneveld  |  Nicholas.Sonneveld at utas.edu.au
IT Resources, University of Tasmania, Private Bag 69, Hobart Tas 7001
(03) 6226 6377  |  0407 336 309  |  Fax (03) 6226 7171


More information about the mauiusers mailing list