[torqueusers] mom not killing processes

Garrick Staples garrick at clusterresources.com
Thu Mar 1 16:37:30 MST 2007


On Wed, Feb 28, 2007 at 04:40:56PM -0500, Geoffrey W. Cowles alleged:
> Hello,
>    I have set up maui and torque to handle job scheduling on our 128  
> node cluster.  I am able to submit and run a job.  When I delete the  
> job from the queue using qdel, it will exit the queue, but all the  
> executables will continue running indefinitely on the nodes.  I  
> examined the server logs, no errors, the moms are saying everything  
> was killed correctly and the server can delete the job.
> 
> The mom_logs give some information when I increase the loglevel to  
> 6.   I have included portions of the mom_log from the node which  
> would have been the first node in the hostlist, the 'master' node of  
> the parallel run.     The actual job pids  on the node (two cpus, two  
> jobs) are 19265 and 19264.  I will paraphrase the mom_log because the  
> lines related to the submission, checking, and deleting of the job  
> total more than 1000.

pbs_mom knows that all direct child processes in the original session id
are part of the job and will kill them when the job is over.  But all
processes that change their session id (usually daemonizing), or ssh/rsh
to spawn new processes are out of pbs_mom's control.

For that, you need to kill those processes in your epilogue script.


> 
> *** I get some lines about successfully launching the job.  Job is  
> definitely running.
> 
> ***** then I get a line requesting a status update followed by  
> multiple lines like this.  The correct PIDs are in these lines
> 02/28/2007 16:13:06;0002;   pbs_mom;n/a;sessions;sessions[1]: pid  
> 19120 sid 19035

This is MOM reporting all sessions that it sees on the node, not that it
has anyway of matching these up with a job.

 
> **** status update looks okay
> pbs_mom;n/a;is_update_stat;status update successfully sent to  
> hydra.local
> 
> 
> *** mom receives kill signal from pbs_server
> pbs_mom;Job;137.hydra.local;signalling job with signal SIGTERM
> 
> *** mom kills some processes, a bunch of lines like this.  NONE of  
> them the correct PID, qdel shows job is gone.
> pbs_mom;Job;137.hydra.local;kill_task: killing pid 19035 task 1 with  
> sig 15
> pbs_mom;Job;137.hydra.local;kill_task: killing pid 19127 task 1 with  
> sig 15
> ...
> ...
> 02/28/2007 16:13:42;0008;   pbs_mom;Job;137.hydra.local;kill_job done

These are likely the top level processes: the user's login shell, the
job script, and direct child processes.


> ****  my executables are clearly still running on the node but the  
> mom sends some messages that imply things are okay
> 02/28/2007 16:13:49;0008;   pbs_mom;Job;process_request;request type  
> DeleteJob from host hydra.local received
> 02/28/2007 16:13:49;0008;   pbs_mom;Job;process_request;request type  
> DeleteJob from host hydra.local allowed
> 02/28/2007 16:13:49;0008;   pbs_mom;Job;dispatch_request;dispatching  
> request DeleteJob on sd=10
> 02/28/2007 16:13:49;0080;   pbs_mom;Job;137.hydra.local;deleting job  
> 137.hydra.local in state EXITED
> 02/28/2007 16:13:49;0080;   pbs_mom;Job;137.hydra.local;removing job
> 02/28/2007 16:13:49;0080;   pbs_mom;Job;137.hydra.local;removed job  
> script
> 02/28/2007 16:13:49;0080;   pbs_mom;Job;137.hydra.local;removed job file

Yes, it cleaned up everything it could.


> *** then a strange requested status update every few minutes -->  
> includes the job PIDS?
> 
> 02/28/2007 16:22:06;0002;   pbs_mom;n/a;is_update_stat;composing  
> status update for server
> 02/28/2007 16:22:06;0002;   pbs_mom;n/a;sessions;sessions[0]: pid  
> 19130 sid 19130
> 02/28/2007 16:22:06;0002;   pbs_mom;n/a;sessions;sessions[1]: pid  
> 19131 sid 19131
> 02/28/2007 16:22:06;0002;   pbs_mom;n/a;sessions;sessions[2]: pid  
> 19264 sid 19131
> 02/28/2007 16:22:06;0002;   pbs_mom;n/a;sessions;sessions[2]: pid  
> 19265 sid 19130
> 02/28/2007 16:22:06;0002;   pbs_mom;n/a;sessions;sessions[0]: pid  
> 19130 sid 19130
> 02/28/2007 16:22:06;0002;   pbs_mom;n/a;sessions;sessions[1]: pid  
> 19131 sid 19131
> 02/28/2007 16:22:06;0002;   pbs_mom;n/a;sessions;sessions[2]: pid  
> 19264 sid 19131
> 02/28/2007 16:22:06;0002;   pbs_mom;n/a;sessions;sessions[2]: pid  
> 19265 sid 19130
> 02/28/2007 16:22:06;0002;   pbs_mom;n/a;nusers;nusers[0]: pid 19130  
> uid 1233
> 02/28/2007 16:22:06;0002;   pbs_mom;n/a;nusers;nusers[1]: pid 19131  
> uid 1233
> 02/28/2007 16:22:06;0002;   pbs_mom;n/a;nusers;nusers[1]: pid 19264  
> uid 1233
> 02/28/2007 16:22:06;0002;   pbs_mom;n/a;nusers;nusers[1]: pid 19265  
> uid 1233
> 02/28/2007 16:22:06;0002;   pbs_mom;n/a;totmem;totmem: total  
> mem=3150086144
> 02/28/2007 16:22:06;0002;   pbs_mom;n/a;availmem;availmem: free  
> mem=2663895040
> 02/28/2007 16:22:06;0002;   pbs_mom;n/a;is_update_stat;status update  
> successfully sent to hydra.local

Again, pbs_mom is reporting everything it can see, but doesn't
necessarily have enough info to match it up with a specific job.



More information about the torqueusers mailing list