[torqueusers] zombie jobs

Ake Ake.Sandgren at hpc2n.umu.se
Wed Oct 13 23:50:31 MDT 2004


On Wed, Oct 13, 2004 at 08:59:47AM -0400, Stewart.Samuels at aventis.com wrote:
> We have a 90 compute node, dual master node beowulf cluster executing torque-1.0.1p6 and maui-3.2.6p6.  The problem we are seeing is that when a compute node fails, jobs get hosed in the queue and cannot be deleted either by the user or root administrator.  Has anyone else seen this problem?  If so, how did you clear the job(s)?  The only way I have been able to clear the job(s) is to rebuild a new server database by performing the command "pbs_server -t create" and then restarting the whole system normally subsequently with a "/sbin/services pbs_server reboot" scenario.

qterm -t quick
rm /var/spool/PBS/server_priv/jobs/jobid.*
rsh all_nodes_where_job_started rm /var/spool/PBS/mom_priv/jobs/jobid.*
restart relevant moms
start pbs_server

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: ake at hpc2n.umu.se	Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se


More information about the torqueusers mailing list