[torquedev] [torqueusers] pbs_mom crashing

Joshua Bernstein jbernstein at penguincomputing.com
Wed Jun 17 17:35:21 MDT 2009



Glen Beane wrote:
> On Wed, Jun 17, 2009 at 6:53 PM, Joshua
> Bernstein<jbernstein at penguincomputing.com> wrote:
>> Glen Beane wrote:
>>> I had two nodes do this yesterday,  in both cases the last thing in
>>> the log file is an "obit sent to server" message.  These nodes had
>>> been running some single processors jobs that took about 15 hours
>>> each.  Probably at most each node had run about a dozen of these jobs
>>> (4 at a time) out of 1,000 or so that were submitted to the cluster.
>>> I'm not sure how long it had been since pbs_mom on these two nodes had
>>> been restarted.
>> Did this happen after the fix I suggested? Have you been able to at least
>> compare the backtraces?
> 
> 
> I haven't had a chance to try out any fixes yet,  and I don't have
> backtraces.  We've enabled core dumps on some of the nodes but it will
> probably take a while to for us to reproduce the problem.

Let me know whatever I can do to help.

-Josh


More information about the torquedev mailing list