[torqueusers] torque does not kill jobs when wall_time or cpu_time reached

Arnau Bria arnaubria at pic.es
Fri Jun 4 02:45:54 MDT 2010


Hi,

I found why jobs are not killed when cput/wall_time is reached.

# qstat -f 10626859|grep Resource_List
    Resource_List.neednodes = 1
    Resource_List.nodect = 1
    Resource_List.nodes = 1


there's no default resource time limits.

Resource_List.cput or Resource_List.walltime 
So I assume that my resource_max default values are not taken in consideration:

	resources_max.cput = 01:30:00
	resources_max.walltime = 03:00:00
	


and that "breaks" what man says:

          resources_max
                 The  maximum  amount of each resource which can be requested by a single job in this queue.  The queue value supersedes any server wide maximum limit.  For-
                 mat: "resources_max.resource_name=value", see qmgr(1B); default value: infinite usage.

	  resources_default
                 The list of default resource values which are set as limits for a job residing in this queue and for which  the  job  did  not  specify  a  limit.   Format:
                 "resources_default.resource_name=value",  see qmgr(1B); default value: none;  if not set, the default limit for a job is determined by the first of the fol-
                 lowing attributes which is set: server’s resources_default, queue’s resources_max, server’s resources_max.  If none of these are set, the job will unlimited
                 resource usage.


Cause if user does not request time limits, they're not specified, so
it has no sense.



neither Resource_List.neednodes = slc5_x64, but this is defined as default resource:

# qmgr -c "p q glong_sl5"
set queue glong_sl5 resources_default.neednodes = slc5_x64


I changed my queues by adding:

Qmgr: s q long resources_default.cput = 48:00:00
Qmgr: s q long resources_default.walltime = 72:00:00

and now jobs have all (includeing default node) those resources defined.



----------------------
Now, after addind those limits to queue, I qalter a job by adding
walltime and "magically" other default resources appear:

# qstat -f 10625854
Job Id: 10625854.pbs02.pic.es
    Job_Name = STDIN
    Job_Owner = lhpilot001 at ce07.pic.es
    resources_used.cput = 04:35:16
    resources_used.mem = 819224kb
    resources_used.vmem = 2787164kb
    resources_used.walltime = 04:54:34
    job_state = R
    queue = glong_sl5
    server = pbs02.pic.es
    Checkpoint = n
    ctime = Fri Jun  4 07:33:45 2010
    Error_Path = ce07.pic.es:/home/lhpilot001/.lcgjm/globus-cache-export.D2257
	9/batch.err
    exec_host = td163.pic.es/3
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = n
    mtime = Fri Jun  4 07:36:25 2010
    Output_Path = ce07.pic.es:/home/lhpilot001/.lcgjm/globus-cache-export.D225
	79/batch.out
    Priority = 0
    qtime = Fri Jun  4 07:33:45 2010
    Rerunable = False
    Resource_List.neednodes = 1
    Resource_List.nodect = 1
    Resource_List.nodes = 1
    session_id = 1153
    Shell_Path_List = /bin/sh
    stagein = globus-cache-export.D22579.gpg at ce07.pic.es:/home/lhpilot001/.lcg
	jm/globus-cache-export.D22579/globus-cache-export.D22579.gpg
    substate = 42
    Variable_List = PBS_O_HOME=/home/lhpilot001,PBS_O_LANG=en_US.UTF-8,
	PBS_O_LOGNAME=lhpilot001,
	PBS_O_PATH=/usr/kerberos/sbin:/usr/kerberos/bin:/opt/globus/bin:/opt/
	glite/bin:/opt/edg/bin:/opt/lcg/bin:/usr/local/sbin:/usr/local/bin:/sb
	in:/bin:/usr/sbin:/usr/bin:/usr/X11R6/bin:/root/bin,
	PBS_O_MAIL=/var/spool/mail/root,PBS_O_SHELL=/bin/bash,
	PBS_SERVER=pbs02.pic.es,PBS_O_WORKDIR=/home/lhpilot001,
	X509_USER_PROXY=/home/lhpilot001/.globus/job/ce07.pic.es/16309.127562
	9464/x509_up,
	GLOBUS_REMOTE_IO_URL=/home/lhpilot001/.lcgjm/.remote_io_ptr/remote_io
	_file-16309.1275629464,GLOBUS_LOCATION=/opt/globus,
	GLOBUS_GRAM_JOB_CONTACT=https://ce07.pic.es:20100/16309/1275629464/,
	GLOBUS_GRAM_MYJOB_CONTACT=URLx-nexus://ce07.pic.es:20101/,
	SCRATCH_DIRECTORY=/home/lhpilot001/,HOME=/home/lhpilot001,
	LOGNAME=lhpilot001,
	EDG_WL_JOBID=https://wms203.cern.ch:9000/vnxhV8Y4YESKwy98UgyERA,
	GLOBUS_CE=ce07.pic.es:2119/jobmanager-lcgpbs-glong_sl5,
	PBS_O_QUEUE=glong_sl5,PBS_O_HOST=ce07.pic.es
    euser = lhpilot001
    egroup = lhpilot
    hashname = 10625854.pbs02.pic.es
    queue_rank = 483277
    queue_type = E
    etime = Fri Jun  4 07:33:45 2010
    start_time = Fri Jun  4 07:36:25 2010
    start_count = 1
    fault_tolerant = False

# qalter -l walltime=87:00:00 10625854
# qstat -f 10625854
Job Id: 10625854.pbs02.pic.es
    Job_Name = STDIN
    Job_Owner = lhpilot001 at ce07.pic.es
    resources_used.cput = 04:35:16
    resources_used.mem = 819224kb
    resources_used.vmem = 2787164kb
    resources_used.walltime = 04:54:34
    job_state = R
    queue = glong_sl5
    server = pbs02.pic.es
    Checkpoint = n
    ctime = Fri Jun  4 07:33:45 2010
    Error_Path = ce07.pic.es:/home/lhpilot001/.lcgjm/globus-cache-export.D2257
	9/batch.err
    exec_host = td163.pic.es/3
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = n
    mtime = Fri Jun  4 10:30:17 2010
    Output_Path = ce07.pic.es:/home/lhpilot001/.lcgjm/globus-cache-export.D225
	79/batch.out
    Priority = 0
    qtime = Fri Jun  4 07:33:45 2010
    Rerunable = False
    Resource_List.cput = 80:00:00
    Resource_List.neednodes = 1
    Resource_List.nodect = 1
    Resource_List.nodes = 1
    Resource_List.walltime = 87:00:00
    session_id = 1153
    Shell_Path_List = /bin/sh
    stagein = globus-cache-export.D22579.gpg at ce07.pic.es:/home/lhpilot001/.lcg
	jm/globus-cache-export.D22579/globus-cache-export.D22579.gpg
    substate = 42
    Variable_List = PBS_O_HOME=/home/lhpilot001,PBS_O_LANG=en_US.UTF-8,
	PBS_O_LOGNAME=lhpilot001,
	PBS_O_PATH=/usr/kerberos/sbin:/usr/kerberos/bin:/opt/globus/bin:/opt/
	glite/bin:/opt/edg/bin:/opt/lcg/bin:/usr/local/sbin:/usr/local/bin:/sb
	in:/bin:/usr/sbin:/usr/bin:/usr/X11R6/bin:/root/bin,
	PBS_O_MAIL=/var/spool/mail/root,PBS_O_SHELL=/bin/bash,
	PBS_SERVER=pbs02.pic.es,PBS_O_WORKDIR=/home/lhpilot001,
	X509_USER_PROXY=/home/lhpilot001/.globus/job/ce07.pic.es/16309.127562
	9464/x509_up,
	GLOBUS_REMOTE_IO_URL=/home/lhpilot001/.lcgjm/.remote_io_ptr/remote_io
	_file-16309.1275629464,GLOBUS_LOCATION=/opt/globus,
	GLOBUS_GRAM_JOB_CONTACT=https://ce07.pic.es:20100/16309/1275629464/,
	GLOBUS_GRAM_MYJOB_CONTACT=URLx-nexus://ce07.pic.es:20101/,
	SCRATCH_DIRECTORY=/home/lhpilot001/,HOME=/home/lhpilot001,
	LOGNAME=lhpilot001,
	EDG_WL_JOBID=https://wms203.cern.ch:9000/vnxhV8Y4YESKwy98UgyERA,
	GLOBUS_CE=ce07.pic.es:2119/jobmanager-lcgpbs-glong_sl5,
	PBS_O_QUEUE=glong_sl5,PBS_O_HOST=ce07.pic.es
    euser = lhpilot001
    egroup = lhpilot
    hashname = 10625854.pbs02.pic.es
    queue_rank = 483277
    queue_type = E
    etime = Fri Jun  4 07:33:45 2010
    start_time = Fri Jun  4 07:36:25 2010
    Walltime.Remaining = 30276
    start_count = 1
    fault_tolerant = False


An other thing is the strange character at the end of
Walltime.Remaining. it's not mail typo, it's torque's output.



Seems a big bug to me, maybe some developer could give his opinion.

TIA,
Arnau


More information about the torqueusers mailing list