[torqueusers] Unknown Job Id Behavior

Joshua Bernstein jbernstein at penguincomputing.com
Thu Jun 12 15:03:34 MDT 2008



Glen Beane wrote:

>I think I can probably try that out tomorrow, but I would 
> really appreciate it if you could give this a test first.

Alright, I just grabbed the SVN tree from about an hour or so ago and 
gave this a go. At first it seems to do the right thing. When a node 
reboots, and after it comes up I see:

06/12/2008 13:01:23;0004;PBS_Server;Svr;WARNING;ALERT: unable to contact 
node n0
06/12/2008 
13:02:48;0100;PBS_Server;Job;0.goldstar.penguincomputing.com;dequeuing 
from batch, state EXITING
06/12/2008 
13:02:48;0040;PBS_Server;Svr;goldstar.penguincomputing.com;Scheduler 
sent command term

The job then disappears from the server's qstat, but pbsnodes n0 still 
shows the job as being on that node. But the node suddenly gets marked 
as down and it reports:

06/12/2008 13:08:58;0002;   pbs_mom;Svr;im_eof;Premature end of message 
from addr 10.101.10.25:15001
06/12/2008 13:09:14;0002;   pbs_mom;Svr;im_eof;Premature end of message 
from addr 10.2.1.1:15001

Just let me know how I can help!

-Joshua Bernstein
Software Engineer
Penguin Computing


More information about the torqueusers mailing list