[torqueusers] job can't be killed by mom
Tina Declerck
tinad at nersc.gov
Mon Oct 20 16:01:32 MDT 2008
Recently several jobs have not been able to be killed. In one case I
see the following in one of the mom_logs:
10/20/2008 12:27:00;0008; pbs_mom;Job;566567.jacin03-
m.nersc.gov;received request 'SIGNAL_TASK' for job 566567.jacin03-m
from 10.1.60.237:1023
10/20/2008 12:27:00;0008; pbs_mom;Job;566567.jacin03-m;im_request:
SIGNAL_TASK 566567.jacin03-m.nersc.gov from node 0 task 3548 signal 9
10/20/2008 12:27:00;0002; pbs_mom;Svr;im_request;connect from
10.1.60.237:1023
10/20/2008 12:27:00;0008; pbs_mom;Job;566567.jacin03-
m.nersc.gov;received request 'SIGNAL_TASK' for job 566567.jacin03-m
from 10.1.60.237:1023
10/20/2008 12:27:00;0008; pbs_mom;Job;566567.jacin03-
m.nersc.gov;im_request: SIGNAL_TASK 566567.jacin03-m from node 0 task
3549 signal 9
10/20/2008 12:27:00;0002; pbs_mom;Svr;im_request;connect from
10.1.60.237:1023
10/20/2008 12:27:00;0008; pbs_mom;Job;566567.jacin03-
m.nersc.gov;received request 'KILL_JOB' for job 566567.jacin03-m from
10.1.60.237:1023
10/20/2008 12:27:00;0008; pbs_mom;Job;kill_job;im_request: sending
signal 9, "KILL" to job 566567.jacin03-m.nersc.gov, reason: kill_job
message received
10/20/2008 12:27:00;0080; pbs_mom;Svr;scan_for_exiting;searching for
exiting jobs
10/20/2008 12:27:00;0008; pbs_mom;Job;566567.jacin03-m.nersc.gov;one
or more running tasks found - no obit sent
10/20/2008 12:27:24;0002; pbs_mom;n/
a;mom_server_check_connection;sending hello to server jacin03-m
10/20/2008 12:27:24;0002; pbs_mom;n/
a;mom_server_check_connection;sending hello to server jacin03-m
10/20/2008 12:27:24;0002; pbs_mom;n/a;mom_server_update_stat;status
update successfully sent to jacin03-m
10/20/2008 12:27:24;0002; pbs_mom;n/a;mom_server_update_stat;status
update successfully sent to jacin03-m
10/20/2008 12:27:24;0008; pbs_mom;Job;do_rpp;got an inter-server
request
10/20/2008 12:27:24;0001; pbs_mom;Job;is_request;command 2,
"CLUSTER_ADDRS", received
10/20/2008 12:27:24;0008; pbs_mom;Job;do_rpp;got an inter-server
request
10/20/2008 12:27:24;0001; pbs_mom;Job;is_request;command 2,
"CLUSTER_ADDRS", received
10/20/2008 12:27:25;0002; pbs_mom;n/a;mom_server_update_stat;status
update successfully sent to jacin03-m
10/20/2008 12:27:25;0002; pbs_mom;n/a;mom_server_update_stat;status
update successfully sent to jacin03-m
10/20/2008 12:27:45;0002; pbs_mom;Svr;im_request;connect from
10.1.60.237:1023
Prior to forcefully deleting the job I verified that the node didn't
have any user processes still running. Is there somewhere I can see
what the mom thinks is still running?
We are running torque v 2.3.3 with maui.
Thank you,
Tina Declerck
tinad at nersc.gov
More information about the torqueusers
mailing list