[torqueusers] Torque no longer spots dead processes

Chris Samuel csamuel at vpac.org
Sun May 4 23:41:27 MDT 2008


Hi all,

We've had a hardware guy pull the power on a node with
running jobs, and now find that Torque will not recognise
that the dead jobs are no longer there.

The mom logs lots of:

05/05/2008 13:37:38;0080;   pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id), aux=0, type=StatusJob, from PBS_Server at tango-m.vpac.org

but the pbs_server doesn't appear to realise that the
job is no longer there.

Not even the old trick of sending the job signal 0 to
exercise the signal handler works (because the mom already
knows the job doesn't exist), the pbs_server just reports:

05/05/2008 15:27:40;0080;PBS_Server;Req;321507.tango-m.vpac.org;Execution server rejected request
05/05/2008 15:27:40;0080;PBS_Server;Req;321507.tango-m.vpac.org;Execution server rejected request

Both server and mom are 2.3.1-snap.200804211148.

To me this sounds more like a server side bug..

So asides from a qdel command (which I'm going to have
to do now to free up the node for a test job) any other
clues ?

cheers,
Chris
-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency


More information about the torqueusers mailing list