[torqueusers] Unknown Job Id Behavior

Garrick Staples garrick at usc.edu
Thu May 22 17:13:13 MDT 2008


On Thu, May 22, 2008 at 03:42:29PM -0700, Joshua Bernstein alleged:
> Hello All,
> 
> 	Consider a cluster where a node has a job that is running on it, and 
> the node reboots, killing the job, as well as pbs_mom. At this point 
> pbs_server (via pbsnodes) still reports that the node is marked as 
> "job-exclusive", which is fine for now.
> 
> 	The node boots up and pbs_mom starts cleanly. Shortly after the 
> 	pbs_mom on the node recieves a message from the server requesting the 
> status of the job:
> 
> 05/22/2008 14:49:11;0100;   pbs_mom;Req;;Type StatusJob request received 
> from PBS_Server at master, sock=10
> 05/22/2008 14:49:11;0080;   pbs_mom;Req;req_reject;Reject reply 
> code=15001(Unknown Job Id), aux=0, type=StatusJob, from PBS_Server at master
> 
> Here, pbs_mom correctly reports that this JobID is unknown. I would 
> figure that pbs_server should do something with this information, and 
> hence mark the job as failed, dequeue, and mark the node as back up, but 
> nothing seems to happen.
> 
> What am I missing here? If a node reboots with a job on it, that dies, 
> why doesn't pbs_server understand that and report the job as failed?
> 
> I'm running TORQUE 2.1.9

That's not correct.  Upon boot, pbs_mom should recover the job state and send
job exit notices to pbs_server.

Are you accidently starting pbs_mom with an -r or -p option?  The pbs_mom
manpage details the options  (normal boot should have no options).

-- 
Garrick Staples, GNU/Linux HPCC SysAdmin
University of Southern California

Please avoid sending me Word or PowerPoint attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20080522/53b9ce6b/attachment.bin


More information about the torqueusers mailing list