[torqueusers] zombie jobs

Glen Beane beaneg at umcs.maine.edu
Wed Oct 13 08:58:55 MDT 2004


you can shutdown pbs_server, delete the files from server_priv/jobs 
that correspond to the hung job (rm *.job_num), then restart the 
pbs_server

This shouldn't affect any other running jobs, and it won't trash your 
server database

On Oct 13, 2004, at 8:59 AM, <Stewart.Samuels at aventis.com> wrote:

> We have a 90 compute node, dual master node beowulf cluster executing 
> torque-1.0.1p6 and maui-3.2.6p6.  The problem we are seeing is that 
> when a compute node fails, jobs get hosed in the queue and cannot be 
> deleted either by the user or root administrator.  Has anyone else 
> seen this problem?  If so, how did you clear the job(s)?  The only way 
> I have been able to clear the job(s) is to rebuild a new server 
> database by performing the command "pbs_server -t create" and then 
> restarting the whole system normally subsequently with a 
> "/sbin/services pbs_server reboot" scenario.
>
> We are running Redhat's EL Advanced Server 3.0 Update 2 on the cluster.
>
> Any help in removing the jobs without rebuilding the database would be 
> greatly appreciated.
>
> Thanks.
>
>                Stewart Samuels
>                 Technical Advisor
>                Global Unix Engineering Services                
>                <image.tiff>
>             1041 Route 202-206                  
>               Bridgewater, NJ  08807
>
>               (908) 231-4762
>                Stewart.Samuels at Aventis.com
> <ole0.bmp>_______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://supercluster.org/mailman/listinfo/torqueusers




More information about the torqueusers mailing list