[torqueusers] Unknown Job Id Behavior

Glen Beane glen.beane at gmail.com
Thu Jun 12 18:53:40 MDT 2008


On Thu, Jun 12, 2008 at 5:03 PM, Joshua Bernstein <
jbernstein at penguincomputing.com> wrote:

>
>
> Glen Beane wrote:
>
>  I think I can probably try that out tomorrow, but I would really
>> appreciate it if you could give this a test first.
>>
>
> Alright, I just grabbed the SVN tree from about an hour or so ago and gave
> this a go. At first it seems to do the right thing. When a node reboots, and
> after it comes up I see:
>
> 06/12/2008 13:01:23;0004;PBS_Server;Svr;WARNING;ALERT: unable to contact
> node n0
> 06/12/2008 13:02:48;0100;PBS_Server;Job;0.goldstar.penguincomputing.com;dequeuing
> from batch, state EXITING
> 06/12/2008 13:02:48;0040;PBS_Server;Svr;goldstar.penguincomputing.com;Scheduler
> sent command term
>
> The job then disappears from the server's qstat, but pbsnodes n0 still
> shows the job as being on that node. But the node suddenly gets marked as
> down and it reports:
>
> 06/12/2008 13:08:58;0002;   pbs_mom;Svr;im_eof;Premature end of message
> from addr 10.101.10.25:15001
> 06/12/2008 13:09:14;0002;   pbs_mom;Svr;im_eof;Premature end of message
> from addr 10.2.1.1:15001
>
> Just let me know how I can help!


can you try the latest 2.3-fixes?  I had forgotten to release the resources
used by the unknown job.  I just tested this out and pbsnodes no longer
shows the job as being on the node. After several minutes the state of the
node is still free.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20080612/c0ca428d/attachment.html


More information about the torqueusers mailing list