[torquedev] [torqueusers] pbs_mom crashing
glen.beane at gmail.com
Wed Jun 17 17:23:45 MDT 2009
On Wed, Jun 17, 2009 at 6:53 PM, Joshua
Bernstein<jbernstein at penguincomputing.com> wrote:
> Glen Beane wrote:
>> I had two nodes do this yesterday, in both cases the last thing in
>> the log file is an "obit sent to server" message. These nodes had
>> been running some single processors jobs that took about 15 hours
>> each. Probably at most each node had run about a dozen of these jobs
>> (4 at a time) out of 1,000 or so that were submitted to the cluster.
>> I'm not sure how long it had been since pbs_mom on these two nodes had
>> been restarted.
> Did this happen after the fix I suggested? Have you been able to at least
> compare the backtraces?
I haven't had a chance to try out any fixes yet, and I don't have
backtraces. We've enabled core dumps on some of the nodes but it will
probably take a while to for us to reproduce the problem.
More information about the torquedev