[torqueusers] Unknown Job Id Behavior

Joshua Bernstein jbernstein at penguincomputing.com
Tue Jun 17 17:45:25 MDT 2008



Glen Beane wrote:
> 
> 
> On Mon, Jun 16, 2008 at 2:41 PM, Joshua Bernstein 
> <jbernstein at penguincomputing.com 
> <mailto:jbernstein at penguincomputing.com>> wrote:
> 
> 
> 
>     Glen Beane wrote:
> 
> 
> 
>         On Thu, Jun 12, 2008 at 5:03 PM, Joshua Bernstein
>         <jbernstein at penguincomputing.com
>         <mailto:jbernstein at penguincomputing.com>
>         <mailto:jbernstein at penguincomputing.com
>         <mailto:jbernstein at penguincomputing.com>>> wrote:
> 
> 
> 
>            Glen Beane wrote:
> 
>                I think I can probably try that out tomorrow, but I would
>         really
>                appreciate it if you could give this a test first.
> 
> 
>            Alright, I just grabbed the SVN tree from about an hour or so ago
>            and gave this a go. At first it seems to do the right thing.
>         When a
>            node reboots, and after it comes up I see:
> 
>            06/12/2008 13:01:23;0004;PBS_Server;Svr;WARNING;ALERT: unable to
>            contact node n0
>            06/12/2008
>            13:02:48;0100;PBS_Server;Job;0.goldstar.penguincomputing.com
>         <http://0.goldstar.penguincomputing.com/>
>            <http://0.goldstar.penguincomputing.com
>         <http://0.goldstar.penguincomputing.com/>>;dequeuing from batch,
>         state
> 
>            EXITING
>            06/12/2008
>            13:02:48;0040;PBS_Server;Svr;goldstar.penguincomputing.com
>         <http://goldstar.penguincomputing.com/>
>            <http://goldstar.penguincomputing.com
>         <http://goldstar.penguincomputing.com/>>;Scheduler sent command
>         term
> 
> 
>            The job then disappears from the server's qstat, but pbsnodes n0
>            still shows the job as being on that node. But the node suddenly
>            gets marked as down and it reports:
> 
>            06/12/2008 13:08:58;0002;   pbs_mom;Svr;im_eof;Premature end of
>            message from addr 10.101.10.25:15001
>         <http://10.101.10.25:15001/> <http://10.101.10.25:15001
>         <http://10.101.10.25:15001/>>
> 
>            06/12/2008 13:09:14;0002;   pbs_mom;Svr;im_eof;Premature end of
>            message from addr 10.2.1.1:15001 <http://10.2.1.1:15001/>
>         <http://10.2.1.1:15001 <http://10.2.1.1:15001/>>
> 
> 
>            Just let me know how I can help!
> 
> 
>         can you try the latest 2.3-fixes?  I had forgotten to release
>         the resources used by the unknown job.  I just tested this out
>         and pbsnodes no longer shows the job as being on the node. After
>         several minutes the state of the node is still free.
> 
> 
>     I'll see if I can give it a build and a shot today.
> 
>  
> If it works out for you, I'll add in the code to requeue jobs that are 
> rerunnable.

Yup, it seems I'm getting some issue with the pbs_mom's connecting back 
to pbs_server. I don't understand it though, perhaps there was other 
change in 2.3-fixes that I'm picking up that is causing these issue.

06/17/2008 15:54:30;0002;   pbs_mom;Svr;im_eof;Premature end of message 
from addr 10.2.1.1:15001
06/17/2008 15:55:49;0002;   pbs_mom;Svr;im_eof;Premature end of message 
from addr 10.101.10.25:15001

-Josh


More information about the torqueusers mailing list