[torqueusers] mom not killing processes
Geoffrey W. Cowles
gcowles at umassd.edu
Wed Feb 28 14:40:56 MST 2007
Hello,
I have set up maui and torque to handle job scheduling on our 128
node cluster. I am able to submit and run a job. When I delete the
job from the queue using qdel, it will exit the queue, but all the
executables will continue running indefinitely on the nodes. I
examined the server logs, no errors, the moms are saying everything
was killed correctly and the server can delete the job.
The mom_logs give some information when I increase the loglevel to
6. I have included portions of the mom_log from the node which
would have been the first node in the hostlist, the 'master' node of
the parallel run. The actual job pids on the node (two cpus, two
jobs) are 19265 and 19264. I will paraphrase the mom_log because the
lines related to the submission, checking, and deleting of the job
total more than 1000.
*** I get some lines about successfully launching the job. Job is
definitely running.
***** then I get a line requesting a status update followed by
multiple lines like this. The correct PIDs are in these lines
02/28/2007 16:13:06;0002; pbs_mom;n/a;sessions;sessions[1]: pid
19120 sid 19035
**** status update looks okay
pbs_mom;n/a;is_update_stat;status update successfully sent to
hydra.local
*** mom receives kill signal from pbs_server
pbs_mom;Job;137.hydra.local;signalling job with signal SIGTERM
*** mom kills some processes, a bunch of lines like this. NONE of
them the correct PID, qdel shows job is gone.
pbs_mom;Job;137.hydra.local;kill_task: killing pid 19035 task 1 with
sig 15
pbs_mom;Job;137.hydra.local;kill_task: killing pid 19127 task 1 with
sig 15
...
...
02/28/2007 16:13:42;0008; pbs_mom;Job;137.hydra.local;kill_job done
**** my executables are clearly still running on the node but the
mom sends some messages that imply things are okay
02/28/2007 16:13:49;0008; pbs_mom;Job;process_request;request type
DeleteJob from host hydra.local received
02/28/2007 16:13:49;0008; pbs_mom;Job;process_request;request type
DeleteJob from host hydra.local allowed
02/28/2007 16:13:49;0008; pbs_mom;Job;dispatch_request;dispatching
request DeleteJob on sd=10
02/28/2007 16:13:49;0080; pbs_mom;Job;137.hydra.local;deleting job
137.hydra.local in state EXITED
02/28/2007 16:13:49;0080; pbs_mom;Job;137.hydra.local;removing job
02/28/2007 16:13:49;0080; pbs_mom;Job;137.hydra.local;removed job
script
02/28/2007 16:13:49;0080; pbs_mom;Job;137.hydra.local;removed job file
*** then a strange requested status update every few minutes -->
includes the job PIDS?
02/28/2007 16:22:06;0002; pbs_mom;n/a;is_update_stat;composing
status update for server
02/28/2007 16:22:06;0002; pbs_mom;n/a;sessions;sessions[0]: pid
19130 sid 19130
02/28/2007 16:22:06;0002; pbs_mom;n/a;sessions;sessions[1]: pid
19131 sid 19131
02/28/2007 16:22:06;0002; pbs_mom;n/a;sessions;sessions[2]: pid
19264 sid 19131
02/28/2007 16:22:06;0002; pbs_mom;n/a;sessions;sessions[2]: pid
19265 sid 19130
02/28/2007 16:22:06;0002; pbs_mom;n/a;sessions;sessions[0]: pid
19130 sid 19130
02/28/2007 16:22:06;0002; pbs_mom;n/a;sessions;sessions[1]: pid
19131 sid 19131
02/28/2007 16:22:06;0002; pbs_mom;n/a;sessions;sessions[2]: pid
19264 sid 19131
02/28/2007 16:22:06;0002; pbs_mom;n/a;sessions;sessions[2]: pid
19265 sid 19130
02/28/2007 16:22:06;0002; pbs_mom;n/a;nusers;nusers[0]: pid 19130
uid 1233
02/28/2007 16:22:06;0002; pbs_mom;n/a;nusers;nusers[1]: pid 19131
uid 1233
02/28/2007 16:22:06;0002; pbs_mom;n/a;nusers;nusers[1]: pid 19264
uid 1233
02/28/2007 16:22:06;0002; pbs_mom;n/a;nusers;nusers[1]: pid 19265
uid 1233
02/28/2007 16:22:06;0002; pbs_mom;n/a;totmem;totmem: total
mem=3150086144
02/28/2007 16:22:06;0002; pbs_mom;n/a;availmem;availmem: free
mem=2663895040
02/28/2007 16:22:06;0002; pbs_mom;n/a;is_update_stat;status update
successfully sent to hydra.local
Please Help!!!
-G
More information about the torqueusers
mailing list