[torqueusers] Torque not deleting job
Adam Emerich
aemerich at us.ibm.com
Fri Apr 20 10:14:40 MDT 2007
Answers:
1. root 2015 1 0 08:54 ? 00:00:02 /usr/local/sbin/pbs_mom
-> by default pbs_mom is not started with "-r" on our system
2. There is no entry in the server log for a failed epilogue or even a
message that says the job is being terminated (note jobid is now 1160 as I
had to recreate the issue to get more details). The first failure in the
log is due to another process being run that was eventually preempted by
job 1160:
04/20/2007 08:46:16;0008;PBS_Server;Job;1160.rrmaster;Job Queued at request
of aemerich at rrmaster, owner = aemerich at rrmaster, job name = STDIN, queue =
dque
04/20/2007 08:46:26;0008;PBS_Server;Job;1160.rrmaster;Job Modified at
request of root at rrmaster
04/20/2007 08:46:26;0008;PBS_Server;Job;1160.rrmaster;could not locate
requested resources 'n01-01-06' (node_spec failed) job allocation request
exceeds currently available cluster nodes, 1 requested, 0 available
04/20/2007 08:46:26;0008;PBS_Server;Job;1160.rrmaster;Job Modified at
request of root at rrmaster
04/20/2007 08:47:38;0008;PBS_Server;Job;1160.rrmaster;Job Modified at
request of root at rrmaster
04/20/2007 08:47:38;0008;PBS_Server;Job;1160.rrmaster;Job Run at request of
root at rrmaster
04/20/2007 08:47:38;0008;PBS_Server;Job;1160.rrmaster;Job Modified at
request of root at rrmaster
3. "qsig -s 0 1160" did not terminate the job from the server's point of
view.
Thanks
Adam Emerich
IBM Corporation - Rochester, MN
Staff Engineer
Office: 030-3 F305
Office: (507) 253-5483
Cell: (507) 358-2999
aemerich at us.ibm.com
"Insanity: doing the same thing over and over again and expecting different
results." -Albert Einstein
Garrick Staples
<garrick at clusterr
esources.com> To
Sent by: torqueusers at supercluster.org
torqueusers-bounc cc
es at supercluster.o
rg Subject
Re: [torqueusers] Torque not
deleting job
04/20/2007 10:13
AM
On Thu, Apr 19, 2007 at 11:41:59AM -0500, Adam Emerich alleged:
>
> I am seeing a case in which torque does not delete an interactive job if
> the node on which the job is running goes down. Here is what I am doing:
>
> qsub -I -l nodes=n01-01-06:ppn=1 -> successfully returns a
prompt
> on the machine requested
>
> Then the node (n01-01-06) is reboot. After the reboot "top" on n01-01-06
> does not show any jobs being run by my userid. However, "showq" shows
the
> following on the torque server:
Is pbs_mom being started with the -r option at boot?
Can you check in server_log to see if an epilogue came and was rejected?
Does 'qsig -s 0 1131' cause the job to exit?
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
More information about the torqueusers
mailing list