[torqueusers] torque does not kill jobs when wall_time or cpu_time reached
Arnau Bria
arnaubria at pic.es
Fri Jun 4 02:45:54 MDT 2010
Hi,
I found why jobs are not killed when cput/wall_time is reached.
# qstat -f 10626859|grep Resource_List
Resource_List.neednodes = 1
Resource_List.nodect = 1
Resource_List.nodes = 1
there's no default resource time limits.
Resource_List.cput or Resource_List.walltime
So I assume that my resource_max default values are not taken in consideration:
resources_max.cput = 01:30:00
resources_max.walltime = 03:00:00
and that "breaks" what man says:
resources_max
The maximum amount of each resource which can be requested by a single job in this queue. The queue value supersedes any server wide maximum limit. For-
mat: "resources_max.resource_name=value", see qmgr(1B); default value: infinite usage.
resources_default
The list of default resource values which are set as limits for a job residing in this queue and for which the job did not specify a limit. Format:
"resources_default.resource_name=value", see qmgr(1B); default value: none; if not set, the default limit for a job is determined by the first of the fol-
lowing attributes which is set: server’s resources_default, queue’s resources_max, server’s resources_max. If none of these are set, the job will unlimited
resource usage.
Cause if user does not request time limits, they're not specified, so
it has no sense.
neither Resource_List.neednodes = slc5_x64, but this is defined as default resource:
# qmgr -c "p q glong_sl5"
set queue glong_sl5 resources_default.neednodes = slc5_x64
I changed my queues by adding:
Qmgr: s q long resources_default.cput = 48:00:00
Qmgr: s q long resources_default.walltime = 72:00:00
and now jobs have all (includeing default node) those resources defined.
----------------------
Now, after addind those limits to queue, I qalter a job by adding
walltime and "magically" other default resources appear:
# qstat -f 10625854
Job Id: 10625854.pbs02.pic.es
Job_Name = STDIN
Job_Owner = lhpilot001 at ce07.pic.es
resources_used.cput = 04:35:16
resources_used.mem = 819224kb
resources_used.vmem = 2787164kb
resources_used.walltime = 04:54:34
job_state = R
queue = glong_sl5
server = pbs02.pic.es
Checkpoint = n
ctime = Fri Jun 4 07:33:45 2010
Error_Path = ce07.pic.es:/home/lhpilot001/.lcgjm/globus-cache-export.D2257
9/batch.err
exec_host = td163.pic.es/3
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = n
mtime = Fri Jun 4 07:36:25 2010
Output_Path = ce07.pic.es:/home/lhpilot001/.lcgjm/globus-cache-export.D225
79/batch.out
Priority = 0
qtime = Fri Jun 4 07:33:45 2010
Rerunable = False
Resource_List.neednodes = 1
Resource_List.nodect = 1
Resource_List.nodes = 1
session_id = 1153
Shell_Path_List = /bin/sh
stagein = globus-cache-export.D22579.gpg at ce07.pic.es:/home/lhpilot001/.lcg
jm/globus-cache-export.D22579/globus-cache-export.D22579.gpg
substate = 42
Variable_List = PBS_O_HOME=/home/lhpilot001,PBS_O_LANG=en_US.UTF-8,
PBS_O_LOGNAME=lhpilot001,
PBS_O_PATH=/usr/kerberos/sbin:/usr/kerberos/bin:/opt/globus/bin:/opt/
glite/bin:/opt/edg/bin:/opt/lcg/bin:/usr/local/sbin:/usr/local/bin:/sb
in:/bin:/usr/sbin:/usr/bin:/usr/X11R6/bin:/root/bin,
PBS_O_MAIL=/var/spool/mail/root,PBS_O_SHELL=/bin/bash,
PBS_SERVER=pbs02.pic.es,PBS_O_WORKDIR=/home/lhpilot001,
X509_USER_PROXY=/home/lhpilot001/.globus/job/ce07.pic.es/16309.127562
9464/x509_up,
GLOBUS_REMOTE_IO_URL=/home/lhpilot001/.lcgjm/.remote_io_ptr/remote_io
_file-16309.1275629464,GLOBUS_LOCATION=/opt/globus,
GLOBUS_GRAM_JOB_CONTACT=https://ce07.pic.es:20100/16309/1275629464/,
GLOBUS_GRAM_MYJOB_CONTACT=URLx-nexus://ce07.pic.es:20101/,
SCRATCH_DIRECTORY=/home/lhpilot001/,HOME=/home/lhpilot001,
LOGNAME=lhpilot001,
EDG_WL_JOBID=https://wms203.cern.ch:9000/vnxhV8Y4YESKwy98UgyERA,
GLOBUS_CE=ce07.pic.es:2119/jobmanager-lcgpbs-glong_sl5,
PBS_O_QUEUE=glong_sl5,PBS_O_HOST=ce07.pic.es
euser = lhpilot001
egroup = lhpilot
hashname = 10625854.pbs02.pic.es
queue_rank = 483277
queue_type = E
etime = Fri Jun 4 07:33:45 2010
start_time = Fri Jun 4 07:36:25 2010
start_count = 1
fault_tolerant = False
# qalter -l walltime=87:00:00 10625854
# qstat -f 10625854
Job Id: 10625854.pbs02.pic.es
Job_Name = STDIN
Job_Owner = lhpilot001 at ce07.pic.es
resources_used.cput = 04:35:16
resources_used.mem = 819224kb
resources_used.vmem = 2787164kb
resources_used.walltime = 04:54:34
job_state = R
queue = glong_sl5
server = pbs02.pic.es
Checkpoint = n
ctime = Fri Jun 4 07:33:45 2010
Error_Path = ce07.pic.es:/home/lhpilot001/.lcgjm/globus-cache-export.D2257
9/batch.err
exec_host = td163.pic.es/3
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = n
mtime = Fri Jun 4 10:30:17 2010
Output_Path = ce07.pic.es:/home/lhpilot001/.lcgjm/globus-cache-export.D225
79/batch.out
Priority = 0
qtime = Fri Jun 4 07:33:45 2010
Rerunable = False
Resource_List.cput = 80:00:00
Resource_List.neednodes = 1
Resource_List.nodect = 1
Resource_List.nodes = 1
Resource_List.walltime = 87:00:00
session_id = 1153
Shell_Path_List = /bin/sh
stagein = globus-cache-export.D22579.gpg at ce07.pic.es:/home/lhpilot001/.lcg
jm/globus-cache-export.D22579/globus-cache-export.D22579.gpg
substate = 42
Variable_List = PBS_O_HOME=/home/lhpilot001,PBS_O_LANG=en_US.UTF-8,
PBS_O_LOGNAME=lhpilot001,
PBS_O_PATH=/usr/kerberos/sbin:/usr/kerberos/bin:/opt/globus/bin:/opt/
glite/bin:/opt/edg/bin:/opt/lcg/bin:/usr/local/sbin:/usr/local/bin:/sb
in:/bin:/usr/sbin:/usr/bin:/usr/X11R6/bin:/root/bin,
PBS_O_MAIL=/var/spool/mail/root,PBS_O_SHELL=/bin/bash,
PBS_SERVER=pbs02.pic.es,PBS_O_WORKDIR=/home/lhpilot001,
X509_USER_PROXY=/home/lhpilot001/.globus/job/ce07.pic.es/16309.127562
9464/x509_up,
GLOBUS_REMOTE_IO_URL=/home/lhpilot001/.lcgjm/.remote_io_ptr/remote_io
_file-16309.1275629464,GLOBUS_LOCATION=/opt/globus,
GLOBUS_GRAM_JOB_CONTACT=https://ce07.pic.es:20100/16309/1275629464/,
GLOBUS_GRAM_MYJOB_CONTACT=URLx-nexus://ce07.pic.es:20101/,
SCRATCH_DIRECTORY=/home/lhpilot001/,HOME=/home/lhpilot001,
LOGNAME=lhpilot001,
EDG_WL_JOBID=https://wms203.cern.ch:9000/vnxhV8Y4YESKwy98UgyERA,
GLOBUS_CE=ce07.pic.es:2119/jobmanager-lcgpbs-glong_sl5,
PBS_O_QUEUE=glong_sl5,PBS_O_HOST=ce07.pic.es
euser = lhpilot001
egroup = lhpilot
hashname = 10625854.pbs02.pic.es
queue_rank = 483277
queue_type = E
etime = Fri Jun 4 07:33:45 2010
start_time = Fri Jun 4 07:36:25 2010
Walltime.Remaining = 30276
start_count = 1
fault_tolerant = False
An other thing is the strange character at the end of
Walltime.Remaining. it's not mail typo, it's torque's output.
Seems a big bug to me, maybe some developer could give his opinion.
TIA,
Arnau
More information about the torqueusers
mailing list