[torqueusers] Unknown Job Id Behavior

Joshua Bernstein jbernstein at penguincomputing.com
Tue Jun 17 17:21:22 MDT 2008



Glen Beane wrote:
> 
> 
> On Mon, Jun 16, 2008 at 2:41 PM, Joshua Bernstein 
> <jbernstein at penguincomputing.com 
> <mailto:jbernstein at penguincomputing.com>> wrote:
> 
> 
> 
>     Glen Beane wrote:
> 
> 
> 
>         On Thu, Jun 12, 2008 at 5:03 PM, Joshua Bernstein
>         <jbernstein at penguincomputing.com
>         <mailto:jbernstein at penguincomputing.com>
>         <mailto:jbernstein at penguincomputing.com
>         <mailto:jbernstein at penguincomputing.com>>> wrote:
> 
> 
> 
>            Glen Beane wrote:
> 
>                I think I can probably try that out tomorrow, but I would
>         really
>                appreciate it if you could give this a test first.
> 
> 
>            Alright, I just grabbed the SVN tree from about an hour or so ago
>            and gave this a go. At first it seems to do the right thing.
>         When a
>            node reboots, and after it comes up I see:
> 
>            06/12/2008 13:01:23;0004;PBS_Server;Svr;WARNING;ALERT: unable to
>            contact node n0
>            06/12/2008
>            13:02:48;0100;PBS_Server;Job;0.goldstar.penguincomputing.com
>         <http://0.goldstar.penguincomputing.com/>
>            <http://0.goldstar.penguincomputing.com
>         <http://0.goldstar.penguincomputing.com/>>;dequeuing from batch,
>         state
> 
>            EXITING
>            06/12/2008
>            13:02:48;0040;PBS_Server;Svr;goldstar.penguincomputing.com
>         <http://goldstar.penguincomputing.com/>
>            <http://goldstar.penguincomputing.com
>         <http://goldstar.penguincomputing.com/>>;Scheduler sent command
>         term
> 
> 
>            The job then disappears from the server's qstat, but pbsnodes n0
>            still shows the job as being on that node. But the node suddenly
>            gets marked as down and it reports:
> 
>            06/12/2008 13:08:58;0002;   pbs_mom;Svr;im_eof;Premature end of
>            message from addr 10.101.10.25:15001
>         <http://10.101.10.25:15001/> <http://10.101.10.25:15001
>         <http://10.101.10.25:15001/>>
> 
>            06/12/2008 13:09:14;0002;   pbs_mom;Svr;im_eof;Premature end of
>            message from addr 10.2.1.1:15001 <http://10.2.1.1:15001/>
>         <http://10.2.1.1:15001 <http://10.2.1.1:15001/>>
> 
> 
>            Just let me know how I can help!
> 
> 
>         can you try the latest 2.3-fixes?  I had forgotten to release
>         the resources used by the unknown job.  I just tested this out
>         and pbsnodes no longer shows the job as being on the node. After
>         several minutes the state of the node is still free.
> 
> 
>     I'll see if I can give it a build and a shot today.
> 
>  
> If it works out for you, I'll add in the code to requeue jobs that are 
> rerunnable.

So far this seems to be doing the trick. I'm having a bit of an issue 
with the server communicating with the nodes and there seems to be a 
need when upgrading from 2.1.9 to this branch to rebuild the server's 
database to get the nodes to communicate. If you have the code to check 
related to the requeing, I'd go ahead and give that a whirl. I can build 
it, give it our regression suite, and see what we end up with.

Perhaps after this we could release a 2.3.1 in the next week or so?

-Josh


More information about the torqueusers mailing list