[torqueusers] RE: job can't be killed by mom
Tina Declerck
tinad at nersc.gov
Wed Oct 29 13:01:39 MDT 2008
Here is some additional data.
I see the same type of output that the mom thinks there is one or more
runing tasks found.
Here is what the momctl reports:
job[567126.jacin03-m.nersc.gov] state=EXITING sidlist=8183
However, there is no process a PID of 8183:
ps -elf | grep 8183
4 S root 6976 6399 0 77 0 - 644 pipe_w 11:46 pts/0
00:00:00 grep 8183
Where does the mom look for active processes?
Thank you for any assistance,
Tina Declerck
tinad at nersc.gov
> Recently several jobs have not been able to be killed. In one case I
> see the following in one of the mom_logs:
>
> 10/20/2008 12:27:00;0008; pbs_mom;Job;566567.jacin03-
m.nersc.gov;received request 'SIGNAL_TASK' for job 566567.jacin03-m
from 10.1.60.237:1023
> 10/20/2008 12:27:00;0008; pbs_mom;Job;566567.jacin03-
m;im_request: SIGNAL_TASK 566567.jacin03-m.nersc.gov from node 0 task
3548 signal 9
> 10/20/2008 12:27:00;0002; pbs_mom;Svr;im_request;connect from
10.1.60.237:1023
> 10/20/2008 12:27:00;0008; pbs_mom;Job;566567.jacin03-
m.nersc.gov;received request 'SIGNAL_TASK' for job 566567.jacin03-m
from 10.1.60.237:1023
> 10/20/2008 12:27:00;0008; pbs_mom;Job;566567.jacin03-
m.nersc.gov;im_request: SIGNAL_TASK 566567.jacin03-m from node 0 task
3549 signal 9
> 10/20/2008 12:27:00;0002; pbs_mom;Svr;im_request;connect from
10.1.60.237:1023
> 10/20/2008 12:27:00;0008; pbs_mom;Job;566567.jacin03-
m.nersc.gov;received request 'KILL_JOB' for job 566567.jacin03-m from
10.1.60.237:1023
> 10/20/2008 12:27:00;0008; pbs_mom;Job;kill_job;im_request:
sending signal 9, "KILL" to job 566567.jacin03-m.nersc.gov, reason:
kill_job message received
> 10/20/2008 12:27:00;0080; pbs_mom;Svr;scan_for_exiting;searching
for exiting jobs
> 10/20/2008 12:27:00;0008; pbs_mom;Job;566567.jacin03-
m.nersc.gov;one or more running tasks found - no obit sent
> 10/20/2008 12:27:24;0002; pbs_mom;n/
More information about the torqueusers
mailing list