[torqueusers] Unknown Job Id Behavior

Joshua Bernstein jbernstein at penguincomputing.com
Thu May 22 16:42:29 MDT 2008


Hello All,

	Consider a cluster where a node has a job that is running on it, and 
the node reboots, killing the job, as well as pbs_mom. At this point 
pbs_server (via pbsnodes) still reports that the node is marked as 
"job-exclusive", which is fine for now.

	The node boots up and pbs_mom starts cleanly. Shortly after the pbs_mom 
on the node recieves a message from the server requesting the status of 
the job:

05/22/2008 14:49:11;0100;   pbs_mom;Req;;Type StatusJob request received 
from PBS_Server at master, sock=10
05/22/2008 14:49:11;0080;   pbs_mom;Req;req_reject;Reject reply 
code=15001(Unknown Job Id), aux=0, type=StatusJob, from PBS_Server at master

Here, pbs_mom correctly reports that this JobID is unknown. I would 
figure that pbs_server should do something with this information, and 
hence mark the job as failed, dequeue, and mark the node as back up, but 
nothing seems to happen.

What am I missing here? If a node reboots with a job on it, that dies, 
why doesn't pbs_server understand that and report the job as failed?

I'm running TORQUE 2.1.9

-Josh


More information about the torqueusers mailing list