[torqueusers] Unknown Job Id Behavior
Joshua Bernstein
jbernstein at penguincomputing.com
Thu Jun 12 15:03:34 MDT 2008
Glen Beane wrote:
>I think I can probably try that out tomorrow, but I would
> really appreciate it if you could give this a test first.
Alright, I just grabbed the SVN tree from about an hour or so ago and
gave this a go. At first it seems to do the right thing. When a node
reboots, and after it comes up I see:
06/12/2008 13:01:23;0004;PBS_Server;Svr;WARNING;ALERT: unable to contact
node n0
06/12/2008
13:02:48;0100;PBS_Server;Job;0.goldstar.penguincomputing.com;dequeuing
from batch, state EXITING
06/12/2008
13:02:48;0040;PBS_Server;Svr;goldstar.penguincomputing.com;Scheduler
sent command term
The job then disappears from the server's qstat, but pbsnodes n0 still
shows the job as being on that node. But the node suddenly gets marked
as down and it reports:
06/12/2008 13:08:58;0002; pbs_mom;Svr;im_eof;Premature end of message
from addr 10.101.10.25:15001
06/12/2008 13:09:14;0002; pbs_mom;Svr;im_eof;Premature end of message
from addr 10.2.1.1:15001
Just let me know how I can help!
-Joshua Bernstein
Software Engineer
Penguin Computing
More information about the torqueusers
mailing list