[torqueusers] Unknown Job Id Behavior
Garrick Staples
garrick at usc.edu
Thu May 22 17:13:13 MDT 2008
On Thu, May 22, 2008 at 03:42:29PM -0700, Joshua Bernstein alleged:
> Hello All,
>
> Consider a cluster where a node has a job that is running on it, and
> the node reboots, killing the job, as well as pbs_mom. At this point
> pbs_server (via pbsnodes) still reports that the node is marked as
> "job-exclusive", which is fine for now.
>
> The node boots up and pbs_mom starts cleanly. Shortly after the
> pbs_mom on the node recieves a message from the server requesting the
> status of the job:
>
> 05/22/2008 14:49:11;0100; pbs_mom;Req;;Type StatusJob request received
> from PBS_Server at master, sock=10
> 05/22/2008 14:49:11;0080; pbs_mom;Req;req_reject;Reject reply
> code=15001(Unknown Job Id), aux=0, type=StatusJob, from PBS_Server at master
>
> Here, pbs_mom correctly reports that this JobID is unknown. I would
> figure that pbs_server should do something with this information, and
> hence mark the job as failed, dequeue, and mark the node as back up, but
> nothing seems to happen.
>
> What am I missing here? If a node reboots with a job on it, that dies,
> why doesn't pbs_server understand that and report the job as failed?
>
> I'm running TORQUE 2.1.9
That's not correct. Upon boot, pbs_mom should recover the job state and send
job exit notices to pbs_server.
Are you accidently starting pbs_mom with an -r or -p option? The pbs_mom
manpage details the options (normal boot should have no options).
--
Garrick Staples, GNU/Linux HPCC SysAdmin
University of Southern California
Please avoid sending me Word or PowerPoint attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20080522/53b9ce6b/attachment.bin
More information about the torqueusers
mailing list