[torqueusers] Unknown Job Id Behavior

Glen Beane glen.beane at gmail.com
Mon Jun 16 15:32:25 MDT 2008


On Mon, Jun 16, 2008 at 2:41 PM, Joshua Bernstein <
jbernstein at penguincomputing.com> wrote:

>
>
> Glen Beane wrote:
>
>>
>>
>> On Thu, Jun 12, 2008 at 5:03 PM, Joshua Bernstein <
>> jbernstein at penguincomputing.com <mailto:jbernstein at penguincomputing.com>>
>> wrote:
>>
>>
>>
>>    Glen Beane wrote:
>>
>>        I think I can probably try that out tomorrow, but I would really
>>        appreciate it if you could give this a test first.
>>
>>
>>    Alright, I just grabbed the SVN tree from about an hour or so ago
>>    and gave this a go. At first it seems to do the right thing. When a
>>    node reboots, and after it comes up I see:
>>
>>    06/12/2008 13:01:23;0004;PBS_Server;Svr;WARNING;ALERT: unable to
>>    contact node n0
>>    06/12/2008
>>    13:02:48;0100;PBS_Server;Job;0.goldstar.penguincomputing.com
>>    <http://0.goldstar.penguincomputing.com>;dequeuing from batch, state
>>    EXITING
>>    06/12/2008
>>    13:02:48;0040;PBS_Server;Svr;goldstar.penguincomputing.com
>>    <http://goldstar.penguincomputing.com>;Scheduler sent command term
>>
>>    The job then disappears from the server's qstat, but pbsnodes n0
>>    still shows the job as being on that node. But the node suddenly
>>    gets marked as down and it reports:
>>
>>    06/12/2008 13:08:58;0002;   pbs_mom;Svr;im_eof;Premature end of
>>    message from addr 10.101.10.25:15001 <http://10.101.10.25:15001>
>>    06/12/2008 13:09:14;0002;   pbs_mom;Svr;im_eof;Premature end of
>>    message from addr 10.2.1.1:15001 <http://10.2.1.1:15001>
>>
>>    Just let me know how I can help!
>>
>>
>> can you try the latest 2.3-fixes?  I had forgotten to release the
>> resources used by the unknown job.  I just tested this out and pbsnodes no
>> longer shows the job as being on the node. After several minutes the state
>> of the node is still free.
>>
>
> I'll see if I can give it a build and a shot today.
>

If it works out for you, I'll add in the code to requeue jobs that are
rerunnable.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20080616/a6e3f517/attachment.html


More information about the torqueusers mailing list