[torqueusers] torque does not kill jobs when wall_time or cpu_time reached

Arnau Bria arnaubria at pic.es
Thu Jun 3 10:14:12 MDT 2010


Hi all,


I faced a new problem with my fresh torque version:

torque-2.4.9-snap.201005191035.1cri

Jobs keep running but their max wall and cpu time has been reached:

I.e:

# qstat -f 10593824
Job Id: 10593824.pbs02.pic.es
    Job_Name = STDIN
    Job_Owner = ctaprd001 at ce05.pic.es
    resources_used.cput = 138:10:58
    resources_used.mem = 4557496kb
    resources_used.vmem = 6652012kb
    resources_used.walltime = 137:00:00
    job_state = R
    queue = glong_sl5

Where glong_sl5 looks like:

# qstat -q glong_sl5

server: pbs02.pic.es

Queue            Memory CPU Time Walltime Node  Run Que Lm  State
---------------- ------ -------- -------- ----  --- --- --  -----
glong_sl5          --   80:00:00 87:00:00   --  1205 256 --   E R


After a kill of all jobs that reached walltime, and a restart of
pbs_server (cause it hanged for more than 5 minutes), qstat -r (or
qstat) do not show usage time in any runnning job:


# qstat|head -n5
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
10593114.pbs02            STDIN            atpilot002             0 R glong_sl5      
10593203.pbs02            STDIN            atpilot002             0 R glong_sl5      
10593831.pbs02            STDIN            ctaprd001              0 R glong_sl5      

# qstat -r|head -n10

pbs02.pic.es: 
                                                                         Req'd  Req'd   Elap
Job ID               Username Queue    Jobname          SessID NDS   TSK Memory Time  S Time
-------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - -----
10593114.pbs02.p     atpilot0 glong_sl STDIN             11117     1   1    --    --  R   -- 
10593203.pbs02.p     atpilot0 glong_sl STDIN             27765     1   1    --    --  R   -- 
10593831.pbs02.p     ctaprd00 glong_sl STDIN             14654     1   1    --    --  R   -- 
10593832.pbs02.p     ctaprd00 glong_sl STDIN              3259     1   1    --    --  R   -- 
10593833.pbs02.p     ctaprd00 glong_sl STDIN             11133     1   1    --    --  R   -- 


So, anyone could give a hand on this issue?

Is possible to downgrade the installation without losing jobs?
(following upgrade procedure, i .e)?

TIA,
Arnau


More information about the torqueusers mailing list