[torquedev] Jobs remain in queue after process completion in Torque 2.2

Steve Snelgrove ssnelgrove at clusterresources.com
Wed Nov 7 14:10:02 MST 2007


Here is a test example of how to recreate this problem.
Job 57 appears to be stuck in this particular case.



cmd:~$ for i in `seq 1 1000`; do echo sleep 1|qsub -lwalltime=1;done


cmd:~$ qstat -r

makua.cridomain:
                                                                  Req'd  
Req'd   Elap
Job ID               Username Queue    Jobname    SessID NDS   TSK 
Memory Time  S Time
-------------------- -------- -------- ---------- ------ ----- --- 
------ ----- - -----
57.makua.cridomain   ssnelgro batch    STDIN        9250     1  --    
--  00:00 R   --
122.makua.cridomain  ssnelgro batch    STDIN        9722     1  --    
--  00:00 R   --
147.makua.cridomain  ssnelgro batch    STDIN        9910     1  --    
--  00:00 R   --
151.makua.cridomain  ssnelgro batch    STDIN        9939     1  --    
--  00:00 R   --
251.makua.cridomain  ssnelgro batch    STDIN       10721     1  --    
--  00:00 R   --
252.makua.cridomain  ssnelgro batch    STDIN       10725     1  --    
--  00:00 R   --


cmd:/var/spool/torque$ tracejob 57

Job: 57.makua.cridomain

11/07/2007 11:16:57  S    enqueuing into batch, state 1 hop 1
11/07/2007 11:16:57  S    Job Queued at request of 
ssnelgrove at makua.cridomain, owner =
                          ssnelgrove at makua.cridomain, job name = STDIN, 
queue = batch
11/07/2007 11:16:57  S    Job Modified at request of 
Scheduler at makua.cridomain
11/07/2007 11:16:57  L    Not enough of the right type of nodes available
11/07/2007 11:16:57  A    queue=batch
11/07/2007 11:17:10  S    Job Modified at request of 
Scheduler at makua.cridomain
11/07/2007 11:17:10  L    Job Run
11/07/2007 11:17:10  S    Job Run at request of Scheduler at makua.cridomain
11/07/2007 11:17:10  A    user=ssnelgrove group=ssnelgrove jobname=STDIN 
queue=batch ctime=1194459417
                          qtime=1194459417 etime=1194459417 
start=1194459430 owner=ssnelgrove at makua.cridomain
                          exec_host=makua/4 Resource_List.neednodes=1 
Resource_List.nodect=1
                          Resource_List.nodes=1 
Resource_List.walltime=00:00:01
11/07/2007 11:17:11  M    scan_for_terminated: job 57.makua.cridomain 
task 1 terminated, sid=9250
11/07/2007 11:17:11  M    job was terminated
11/07/2007 11:32:22  M    EOF? received attempting to process obit reply
11/07/2007 11:32:22  S    Exit_status=0 resources_used.cput=00:00:00 
resources_used.mem=3112kb
                          resources_used.vmem=11496kb 
resources_used.walltime=00:15:12
11/07/2007 11:32:22  M    obit sent to server
11/07/2007 11:32:22  A    user=ssnelgrove group=ssnelgrove jobname=STDIN 
queue=batch ctime=1194459417
                          qtime=1194459417 etime=1194459417 
start=1194459430 owner=ssnelgrove at makua.cridomain
                          exec_host=makua/4 Resource_List.neednodes=1 
Resource_List.nodect=1
                          Resource_List.nodes=1 
Resource_List.walltime=00:00:01 session=9250 end=1194460342
                          Exit_status=0 resources_used.cput=00:00:00 
resources_used.mem=3112kb
                          resources_used.vmem=11496kb 
resources_used.walltime=00:15:12
11/07/2007 11:37:22  S    dequeuing from batch, state COMPLETE



More information about the torquedev mailing list