[torqueusers] Unknown Job Id Behavior

Joshua Bernstein jbernstein at penguincomputing.com
Mon Jun 16 12:41:06 MDT 2008



Glen Beane wrote:
> 
> 
> On Thu, Jun 12, 2008 at 5:03 PM, Joshua Bernstein 
> <jbernstein at penguincomputing.com 
> <mailto:jbernstein at penguincomputing.com>> wrote:
> 
> 
> 
>     Glen Beane wrote:
> 
>         I think I can probably try that out tomorrow, but I would really
>         appreciate it if you could give this a test first.
> 
> 
>     Alright, I just grabbed the SVN tree from about an hour or so ago
>     and gave this a go. At first it seems to do the right thing. When a
>     node reboots, and after it comes up I see:
> 
>     06/12/2008 13:01:23;0004;PBS_Server;Svr;WARNING;ALERT: unable to
>     contact node n0
>     06/12/2008
>     13:02:48;0100;PBS_Server;Job;0.goldstar.penguincomputing.com
>     <http://0.goldstar.penguincomputing.com>;dequeuing from batch, state
>     EXITING
>     06/12/2008
>     13:02:48;0040;PBS_Server;Svr;goldstar.penguincomputing.com
>     <http://goldstar.penguincomputing.com>;Scheduler sent command term
> 
>     The job then disappears from the server's qstat, but pbsnodes n0
>     still shows the job as being on that node. But the node suddenly
>     gets marked as down and it reports:
> 
>     06/12/2008 13:08:58;0002;   pbs_mom;Svr;im_eof;Premature end of
>     message from addr 10.101.10.25:15001 <http://10.101.10.25:15001>
>     06/12/2008 13:09:14;0002;   pbs_mom;Svr;im_eof;Premature end of
>     message from addr 10.2.1.1:15001 <http://10.2.1.1:15001>
> 
>     Just let me know how I can help!
> 
> 
> can you try the latest 2.3-fixes?  I had forgotten to release the 
> resources used by the unknown job.  I just tested this out and pbsnodes 
> no longer shows the job as being on the node. After several minutes the 
> state of the node is still free.

I'll see if I can give it a build and a shot today.

-Joshua Bernstein
Software Engineer
Penguin Computing


More information about the torqueusers mailing list