[torqueusers] job can't be killed by mom

Tina Declerck tinad at nersc.gov
Mon Oct 20 16:01:32 MDT 2008


Recently several jobs have not been able to be killed.  In one case I  
see the following in one of the mom_logs:

10/20/2008 12:27:00;0008;   pbs_mom;Job;566567.jacin03- 
m.nersc.gov;received request 'SIGNAL_TASK' for job 566567.jacin03-m  
from 10.1.60.237:1023
10/20/2008 12:27:00;0008;   pbs_mom;Job;566567.jacin03-m;im_request:  
SIGNAL_TASK 566567.jacin03-m.nersc.gov from node 0 task 3548 signal 9
10/20/2008 12:27:00;0002;   pbs_mom;Svr;im_request;connect from  
10.1.60.237:1023
10/20/2008 12:27:00;0008;   pbs_mom;Job;566567.jacin03- 
m.nersc.gov;received request 'SIGNAL_TASK' for job 566567.jacin03-m  
from 10.1.60.237:1023
10/20/2008 12:27:00;0008;   pbs_mom;Job;566567.jacin03- 
m.nersc.gov;im_request: SIGNAL_TASK 566567.jacin03-m from node 0 task  
3549 signal 9
10/20/2008 12:27:00;0002;   pbs_mom;Svr;im_request;connect from  
10.1.60.237:1023
10/20/2008 12:27:00;0008;   pbs_mom;Job;566567.jacin03- 
m.nersc.gov;received request 'KILL_JOB' for job 566567.jacin03-m from  
10.1.60.237:1023
10/20/2008 12:27:00;0008;   pbs_mom;Job;kill_job;im_request: sending  
signal 9, "KILL" to job 566567.jacin03-m.nersc.gov, reason: kill_job  
message received
10/20/2008 12:27:00;0080;   pbs_mom;Svr;scan_for_exiting;searching for  
exiting jobs
10/20/2008 12:27:00;0008;   pbs_mom;Job;566567.jacin03-m.nersc.gov;one  
or more running tasks found - no obit sent
10/20/2008 12:27:24;0002;   pbs_mom;n/ 
a;mom_server_check_connection;sending hello to server jacin03-m
10/20/2008 12:27:24;0002;   pbs_mom;n/ 
a;mom_server_check_connection;sending hello to server jacin03-m
10/20/2008 12:27:24;0002;   pbs_mom;n/a;mom_server_update_stat;status  
update successfully sent to jacin03-m
10/20/2008 12:27:24;0002;   pbs_mom;n/a;mom_server_update_stat;status  
update successfully sent to jacin03-m
10/20/2008 12:27:24;0008;   pbs_mom;Job;do_rpp;got an inter-server  
request
10/20/2008 12:27:24;0001;   pbs_mom;Job;is_request;command 2,  
"CLUSTER_ADDRS", received
10/20/2008 12:27:24;0008;   pbs_mom;Job;do_rpp;got an inter-server  
request
10/20/2008 12:27:24;0001;   pbs_mom;Job;is_request;command 2,  
"CLUSTER_ADDRS", received
10/20/2008 12:27:25;0002;   pbs_mom;n/a;mom_server_update_stat;status  
update successfully sent to jacin03-m
10/20/2008 12:27:25;0002;   pbs_mom;n/a;mom_server_update_stat;status  
update successfully sent to jacin03-m
10/20/2008 12:27:45;0002;   pbs_mom;Svr;im_request;connect from  
10.1.60.237:1023

Prior to forcefully deleting the job I verified that the node didn't  
have any user processes still running.  Is there somewhere I can see  
what the mom thinks is still running?

We are running torque v 2.3.3 with maui.

Thank you,

Tina Declerck
tinad at nersc.gov





More information about the torqueusers mailing list