[torqueusers] torque does not kill jobs when wall_time or cpu_time reached
Arnau Bria
arnaubria at pic.es
Thu Jun 3 10:14:12 MDT 2010
Hi all,
I faced a new problem with my fresh torque version:
torque-2.4.9-snap.201005191035.1cri
Jobs keep running but their max wall and cpu time has been reached:
I.e:
# qstat -f 10593824
Job Id: 10593824.pbs02.pic.es
Job_Name = STDIN
Job_Owner = ctaprd001 at ce05.pic.es
resources_used.cput = 138:10:58
resources_used.mem = 4557496kb
resources_used.vmem = 6652012kb
resources_used.walltime = 137:00:00
job_state = R
queue = glong_sl5
Where glong_sl5 looks like:
# qstat -q glong_sl5
server: pbs02.pic.es
Queue Memory CPU Time Walltime Node Run Que Lm State
---------------- ------ -------- -------- ---- --- --- -- -----
glong_sl5 -- 80:00:00 87:00:00 -- 1205 256 -- E R
After a kill of all jobs that reached walltime, and a restart of
pbs_server (cause it hanged for more than 5 minutes), qstat -r (or
qstat) do not show usage time in any runnning job:
# qstat|head -n5
Job id Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
10593114.pbs02 STDIN atpilot002 0 R glong_sl5
10593203.pbs02 STDIN atpilot002 0 R glong_sl5
10593831.pbs02 STDIN ctaprd001 0 R glong_sl5
# qstat -r|head -n10
pbs02.pic.es:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
-------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - -----
10593114.pbs02.p atpilot0 glong_sl STDIN 11117 1 1 -- -- R --
10593203.pbs02.p atpilot0 glong_sl STDIN 27765 1 1 -- -- R --
10593831.pbs02.p ctaprd00 glong_sl STDIN 14654 1 1 -- -- R --
10593832.pbs02.p ctaprd00 glong_sl STDIN 3259 1 1 -- -- R --
10593833.pbs02.p ctaprd00 glong_sl STDIN 11133 1 1 -- -- R --
So, anyone could give a hand on this issue?
Is possible to downgrade the installation without losing jobs?
(following upgrade procedure, i .e)?
TIA,
Arnau
More information about the torqueusers
mailing list