[Mauiusers] running jobs & restarting maui

Thomas Dargel td at chemie.hu-berlin.de
Tue Nov 8 02:20:16 MST 2005


Hi Chris, 

thanks for answering my mail, 

On Tue, Nov 08, 2005 at 09:59:50AM +1100, Chris Samuel wrote:
> On Tue, 8 Nov 2005 01:40 am, Thomas Dargel wrote:
> 
> > sorry when I miss something in the docs, but is it normal that a
> > restart of maui kills all running jobs???
> 
> No!

that's good..
> 
> > How can I keep the jobs running in spite of restarting maui?
> 
> What does Maui log when it kills them ?
> 
after a deeper look into the log, I found this..

11/08 09:20:41 ALERT:    job '561' in state 'Running' has exceeded its wallclock limit (0+S:0) by 16:43:00 (job will be cancelled)
11/08 09:20:41 MSysRegEvent(JOBWCVIOLATION:  job '561' in state 'Running' has exceeded its wallclock limit (0) by 16:43:00 (job will be cancelled)  job start time: Mon Nov  7 16:37:41 ,0,0,1)
11/08 09:20:41 MSysLaunchAction(ASList,1)
11/08 09:20:41 MRMJobCancel(561,MOAB_INFO:  job exceeded wallclock limit ,SC)
11/08 09:20:41 MPBSJobCancel(561,node01,CMsg,Msg,MOAB_INFO:  job exceeded wallclock limit)
11/08 09:20:41 INFO:     job '561' successfully cancelled

Do I have to set a 'wallclock limit' in maui.cfg or when the job is submitted?

> Thinking back - are your users setting walltimes for their jobs ?
> If not - what is the default walltime you are assigning ?
>

No setting for the resources_default.walltime for the server, when using
the torque-scheduler this sets the resources_default.walltime to infinity -
that's what I need also for the maui scheduler.

> What do the output of checkjob and qstat -f look like for a sample job on your 
> system ?
> 

qstat -f 561
Job Id: 561.cnode01.mauicluster
    Job_Name = job.dual
    Job_Owner = td at cnode01.mauicluster
    resources_used.cput = 16:39:49
    resources_used.mem = 517212kb
    resources_used.vmem = 597276kb
    resources_used.walltime = 16:40:02
    job_state = R
    queue = cpu-2
    server = cnode01.mauicluster
    Checkpoint = u
    ctime = Mon Nov  7 16:37:39 2005
    Error_Path = cnode01.mauicluster:/huge/td/cpmd/job.dual.e561
    exec_host = cnode01/1+cnode01/0
    Hold_Types = n
    Join_Path = eo
    Keep_Files = n
    Mail_Points = a
    mtime = Mon Nov  7 16:37:41 2005
    Output_Path = cnode01.mauicluster:/huge/td/cpmd/job.dual.o561
    Priority = 0
    qtime = Mon Nov  7 16:37:39 2005
    Rerunable = False
    Resource_List.mem = 8191mb
    Resource_List.neednodes = 1:ppn=2
    Resource_List.nodect = 1
    Resource_List.nodes = 1:ppn=2
    session_id = 705
    substate = 42
    Variable_List = PBS_O_HOME=/users/td,PBS_O_LANG=en_US.UTF-8,
        PBS_O_LOGNAME=td,
        PBS_O_PATH=/sysinst/bin:/usr/local/bin:/usr/bin:/usr/X11R6/bin:/bin:/u
        sr/games:/opt/gnome/bin:/opt/kde3/bin:/apps/maui/bin:/apps/torque/bin,
        PBS_O_MAIL=/var/mail/td,PBS_O_SHELL=/bin/ksh,
        PBS_O_HOST=cnode01.mauicluster,PBS_O_WORKDIR=/huge/td/cpmd,
        PBS_O_QUEUE=mixpipe
    euser = td
    egroup = qc
    hashname = 561.cnode01
    queue_rank = 611
    queue_type = E
    etime = Mon Nov  7 16:37:39 2005


checkjob -v 561

checking job 561 (RM job '561.cnode01.mauicluster')

State: Running
Creds:  user:td  group:qc  class:cpu-2  qos:DEFAULT
WallTime: 16:41:29 of 99:23:59:59
SubmitTime: Mon Nov  7 16:37:39
  (Time Queued  Total: 00:00:02  Eligible: 00:00:02)

StartTime: Mon Nov  7 16:37:41
Total Tasks: 2

Req[0]  TaskCount: 2  Partition: DEFAULT
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [dual]
Exec:  ''  ExecSize: 0  ImageSize: 0
Dedicated Resources Per Task: PROCS: 1  MEM: 4095M
Utilized Resources Per Task:  PROCS: 0.49  MEM: 2.52  SWAP: 5.83
Avg Util Resources Per Task:  PROCS: 0.49
Max Util Resources Per Task:  PROCS: 0.49  MEM: 2.52  SWAP: 5.83
Average Utilized Memory: 255.79 MB
Average Utilized Procs: 0.98
NodeAccess: SHARED
TasksPerNode: 2  NodeCount: 1
Allocated Nodes:
[cnode01:2]
Task Distribution: cnode01,cnode01


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 1
PartitionMask: [ALL]
Reservation '561' (-16:41:39 -> 99:07:18:20  Duration: 99:23:59:59)
PE:  2.00  StartPriority:  1


When I searched the maui.log, I found the following error-message:

ERROR:    job '571' has NULL WCLimit field

Changing the XFMINWCLIMIT from "00:02:00" to "-1" makes no difference.
Any hints what I have to do?

Thank you in advance,

Thomas.

    > Chris
    > -- 
    >  Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager
    >  Victorian Partnership for Advanced Computing http://www.vpac.org/
    >  Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia
    > 



    > _______________________________________________
    > mauiusers mailing list
    > mauiusers at supercluster.org
    > http://www.supercluster.org/mailman/listinfo/mauiusers




More information about the mauiusers mailing list