[torqueusers] mom not killing processes

Geoffrey W. Cowles gcowles at umassd.edu
Wed Feb 28 14:40:56 MST 2007


Hello,
    I have set up maui and torque to handle job scheduling on our 128  
node cluster.  I am able to submit and run a job.  When I delete the  
job from the queue using qdel, it will exit the queue, but all the  
executables will continue running indefinitely on the nodes.  I  
examined the server logs, no errors, the moms are saying everything  
was killed correctly and the server can delete the job.

The mom_logs give some information when I increase the loglevel to  
6.   I have included portions of the mom_log from the node which  
would have been the first node in the hostlist, the 'master' node of  
the parallel run.     The actual job pids  on the node (two cpus, two  
jobs) are 19265 and 19264.  I will paraphrase the mom_log because the  
lines related to the submission, checking, and deleting of the job  
total more than 1000.



*** I get some lines about successfully launching the job.  Job is  
definitely running.

***** then I get a line requesting a status update followed by  
multiple lines like this.  The correct PIDs are in these lines
02/28/2007 16:13:06;0002;   pbs_mom;n/a;sessions;sessions[1]: pid  
19120 sid 19035

**** status update looks okay
pbs_mom;n/a;is_update_stat;status update successfully sent to  
hydra.local


*** mom receives kill signal from pbs_server
pbs_mom;Job;137.hydra.local;signalling job with signal SIGTERM

*** mom kills some processes, a bunch of lines like this.  NONE of  
them the correct PID, qdel shows job is gone.
pbs_mom;Job;137.hydra.local;kill_task: killing pid 19035 task 1 with  
sig 15
pbs_mom;Job;137.hydra.local;kill_task: killing pid 19127 task 1 with  
sig 15
...
...
02/28/2007 16:13:42;0008;   pbs_mom;Job;137.hydra.local;kill_job done


****  my executables are clearly still running on the node but the  
mom sends some messages that imply things are okay
02/28/2007 16:13:49;0008;   pbs_mom;Job;process_request;request type  
DeleteJob from host hydra.local received
02/28/2007 16:13:49;0008;   pbs_mom;Job;process_request;request type  
DeleteJob from host hydra.local allowed
02/28/2007 16:13:49;0008;   pbs_mom;Job;dispatch_request;dispatching  
request DeleteJob on sd=10
02/28/2007 16:13:49;0080;   pbs_mom;Job;137.hydra.local;deleting job  
137.hydra.local in state EXITED
02/28/2007 16:13:49;0080;   pbs_mom;Job;137.hydra.local;removing job
02/28/2007 16:13:49;0080;   pbs_mom;Job;137.hydra.local;removed job  
script
02/28/2007 16:13:49;0080;   pbs_mom;Job;137.hydra.local;removed job file



*** then a strange requested status update every few minutes -->  
includes the job PIDS?

02/28/2007 16:22:06;0002;   pbs_mom;n/a;is_update_stat;composing  
status update for server
02/28/2007 16:22:06;0002;   pbs_mom;n/a;sessions;sessions[0]: pid  
19130 sid 19130
02/28/2007 16:22:06;0002;   pbs_mom;n/a;sessions;sessions[1]: pid  
19131 sid 19131
02/28/2007 16:22:06;0002;   pbs_mom;n/a;sessions;sessions[2]: pid  
19264 sid 19131
02/28/2007 16:22:06;0002;   pbs_mom;n/a;sessions;sessions[2]: pid  
19265 sid 19130
02/28/2007 16:22:06;0002;   pbs_mom;n/a;sessions;sessions[0]: pid  
19130 sid 19130
02/28/2007 16:22:06;0002;   pbs_mom;n/a;sessions;sessions[1]: pid  
19131 sid 19131
02/28/2007 16:22:06;0002;   pbs_mom;n/a;sessions;sessions[2]: pid  
19264 sid 19131
02/28/2007 16:22:06;0002;   pbs_mom;n/a;sessions;sessions[2]: pid  
19265 sid 19130
02/28/2007 16:22:06;0002;   pbs_mom;n/a;nusers;nusers[0]: pid 19130  
uid 1233
02/28/2007 16:22:06;0002;   pbs_mom;n/a;nusers;nusers[1]: pid 19131  
uid 1233
02/28/2007 16:22:06;0002;   pbs_mom;n/a;nusers;nusers[1]: pid 19264  
uid 1233
02/28/2007 16:22:06;0002;   pbs_mom;n/a;nusers;nusers[1]: pid 19265  
uid 1233
02/28/2007 16:22:06;0002;   pbs_mom;n/a;totmem;totmem: total  
mem=3150086144
02/28/2007 16:22:06;0002;   pbs_mom;n/a;availmem;availmem: free  
mem=2663895040
02/28/2007 16:22:06;0002;   pbs_mom;n/a;is_update_stat;status update  
successfully sent to hydra.local

Please Help!!!

-G




More information about the torqueusers mailing list