[torqueusers] Unknown Job Id Behavior
jbernstein at penguincomputing.com
Thu May 22 16:42:29 MDT 2008
Consider a cluster where a node has a job that is running on it, and
the node reboots, killing the job, as well as pbs_mom. At this point
pbs_server (via pbsnodes) still reports that the node is marked as
"job-exclusive", which is fine for now.
The node boots up and pbs_mom starts cleanly. Shortly after the pbs_mom
on the node recieves a message from the server requesting the status of
05/22/2008 14:49:11;0100; pbs_mom;Req;;Type StatusJob request received
from PBS_Server at master, sock=10
05/22/2008 14:49:11;0080; pbs_mom;Req;req_reject;Reject reply
code=15001(Unknown Job Id), aux=0, type=StatusJob, from PBS_Server at master
Here, pbs_mom correctly reports that this JobID is unknown. I would
figure that pbs_server should do something with this information, and
hence mark the job as failed, dequeue, and mark the node as back up, but
nothing seems to happen.
What am I missing here? If a node reboots with a job on it, that dies,
why doesn't pbs_server understand that and report the job as failed?
I'm running TORQUE 2.1.9
More information about the torqueusers