[torqueusers] Unknown Job Id Behavior
Joshua Bernstein
jbernstein at penguincomputing.com
Tue Jun 17 17:45:25 MDT 2008
Glen Beane wrote:
>
>
> On Mon, Jun 16, 2008 at 2:41 PM, Joshua Bernstein
> <jbernstein at penguincomputing.com
> <mailto:jbernstein at penguincomputing.com>> wrote:
>
>
>
> Glen Beane wrote:
>
>
>
> On Thu, Jun 12, 2008 at 5:03 PM, Joshua Bernstein
> <jbernstein at penguincomputing.com
> <mailto:jbernstein at penguincomputing.com>
> <mailto:jbernstein at penguincomputing.com
> <mailto:jbernstein at penguincomputing.com>>> wrote:
>
>
>
> Glen Beane wrote:
>
> I think I can probably try that out tomorrow, but I would
> really
> appreciate it if you could give this a test first.
>
>
> Alright, I just grabbed the SVN tree from about an hour or so ago
> and gave this a go. At first it seems to do the right thing.
> When a
> node reboots, and after it comes up I see:
>
> 06/12/2008 13:01:23;0004;PBS_Server;Svr;WARNING;ALERT: unable to
> contact node n0
> 06/12/2008
> 13:02:48;0100;PBS_Server;Job;0.goldstar.penguincomputing.com
> <http://0.goldstar.penguincomputing.com/>
> <http://0.goldstar.penguincomputing.com
> <http://0.goldstar.penguincomputing.com/>>;dequeuing from batch,
> state
>
> EXITING
> 06/12/2008
> 13:02:48;0040;PBS_Server;Svr;goldstar.penguincomputing.com
> <http://goldstar.penguincomputing.com/>
> <http://goldstar.penguincomputing.com
> <http://goldstar.penguincomputing.com/>>;Scheduler sent command
> term
>
>
> The job then disappears from the server's qstat, but pbsnodes n0
> still shows the job as being on that node. But the node suddenly
> gets marked as down and it reports:
>
> 06/12/2008 13:08:58;0002; pbs_mom;Svr;im_eof;Premature end of
> message from addr 10.101.10.25:15001
> <http://10.101.10.25:15001/> <http://10.101.10.25:15001
> <http://10.101.10.25:15001/>>
>
> 06/12/2008 13:09:14;0002; pbs_mom;Svr;im_eof;Premature end of
> message from addr 10.2.1.1:15001 <http://10.2.1.1:15001/>
> <http://10.2.1.1:15001 <http://10.2.1.1:15001/>>
>
>
> Just let me know how I can help!
>
>
> can you try the latest 2.3-fixes? I had forgotten to release
> the resources used by the unknown job. I just tested this out
> and pbsnodes no longer shows the job as being on the node. After
> several minutes the state of the node is still free.
>
>
> I'll see if I can give it a build and a shot today.
>
>
> If it works out for you, I'll add in the code to requeue jobs that are
> rerunnable.
Yup, it seems I'm getting some issue with the pbs_mom's connecting back
to pbs_server. I don't understand it though, perhaps there was other
change in 2.3-fixes that I'm picking up that is causing these issue.
06/17/2008 15:54:30;0002; pbs_mom;Svr;im_eof;Premature end of message
from addr 10.2.1.1:15001
06/17/2008 15:55:49;0002; pbs_mom;Svr;im_eof;Premature end of message
from addr 10.101.10.25:15001
-Josh
More information about the torqueusers
mailing list