[torqueusers] Torque no longer spots dead processes
Chris Samuel
csamuel at vpac.org
Sun May 4 23:41:27 MDT 2008
Hi all,
We've had a hardware guy pull the power on a node with
running jobs, and now find that Torque will not recognise
that the dead jobs are no longer there.
The mom logs lots of:
05/05/2008 13:37:38;0080; pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id), aux=0, type=StatusJob, from PBS_Server at tango-m.vpac.org
but the pbs_server doesn't appear to realise that the
job is no longer there.
Not even the old trick of sending the job signal 0 to
exercise the signal handler works (because the mom already
knows the job doesn't exist), the pbs_server just reports:
05/05/2008 15:27:40;0080;PBS_Server;Req;321507.tango-m.vpac.org;Execution server rejected request
05/05/2008 15:27:40;0080;PBS_Server;Req;321507.tango-m.vpac.org;Execution server rejected request
Both server and mom are 2.3.1-snap.200804211148.
To me this sounds more like a server side bug..
So asides from a qdel command (which I'm going to have
to do now to free up the node for a test job) any other
clues ?
cheers,
Chris
--
Christopher Samuel - (03) 9925 4751 - Systems Manager
The Victorian Partnership for Advanced Computing
P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency
More information about the torqueusers
mailing list