[torqueusers] zombie jobs

jacksond at supercluster.org jacksond at supercluster.org
Thu Oct 14 16:59:48 MDT 2004


   Torque 1.1.0p3 contains a new command called 'momctl'.  This command 
allows remote shutdown, reconfiguration, diagnostics, and querying of the 
pbs_mom daemon.  We are currently adding support for a '-c' flag which can 
be used to clear stale job information.  If you would like to beta-test 
this new command, please let us know.


On Thu, 14 Oct 2004, Ake wrote:

> On Wed, Oct 13, 2004 at 08:59:47AM -0400, Stewart.Samuels at aventis.com wrote:
>> We have a 90 compute node, dual master node beowulf cluster executing torque-1.0.1p6 and maui-3.2.6p6.  The problem we are seeing is that when a compute node fails, jobs get hosed in the queue and cannot be deleted either by the user or root administrator.  Has anyone else seen this problem?  If so, how did you clear the job(s)?  The only way I have been able to clear the job(s) is to rebuild a new server database by performing the command "pbs_server -t create" and then restarting the whole system normally subsequently with a "/sbin/services pbs_server reboot" scenario.
> qterm -t quick
> rm /var/spool/PBS/server_priv/jobs/jobid.*
> rsh all_nodes_where_job_started rm /var/spool/PBS/mom_priv/jobs/jobid.*
> restart relevant moms
> start pbs_server

More information about the torqueusers mailing list