[torqueusers] pbs_mom crashing

Joshua Bernstein jbernstein at penguincomputing.com
Wed Jun 17 16:53:41 MDT 2009


Glen Beane wrote:
> I had two nodes do this yesterday,  in both cases the last thing in
> the log file is an "obit sent to server" message.  These nodes had
> been running some single processors jobs that took about 15 hours
> each.  Probably at most each node had run about a dozen of these jobs
> (4 at a time) out of 1,000 or so that were submitted to the cluster.
> I'm not sure how long it had been since pbs_mom on these two nodes had
> been restarted.

Did this happen after the fix I suggested? Have you been able to at least 
compare the backtraces?

-Joshua Bernstein
Senior Software Engineer
Penguin Computing


More information about the torqueusers mailing list