[torqueusers] stability again
saydakov at yahoo-inc.com
Fri Sep 29 15:58:24 MDT 2006
> -----Original Message-----
> From: torqueusers-bounces at supercluster.org [mailto:torqueusers-
> bounces at supercluster.org] On Behalf Of Garrick Staples
> Sent: Friday, September 29, 2006 1:22 PM
> To: torqueusers at supercluster.org
> Subject: Re: [torqueusers] stability again
> On Fri, Sep 29, 2006 at 10:38:39AM -0700, Alexander Saydakov alleged:
> > Hi!
> > I would like to report another incident when rebooting of a few nodes
> > resulted in server crash. Those nodes became unresponsive because of
> > other problem, not related to Torque in any way, were put offline and
> > rebooted. This is not the first time when loosing nodes made server
> > unresponsive or led to core dump.
> > Core was generated by `pbs_server'.
> > Program terminated with signal 11, Segmentation fault.
> > Reading symbols from /usr/lib/libkvm.so.2...done.
> > Reading symbols from /usr/lib/libc.so.4...done.
> > Reading symbols from /usr/libexec/ld-elf.so.1...done.
> > #0 0x1038636 in get_next ()
> > (gdb) bt
> > #0 0x1038636 in get_next ()
> > #1 0x1012a7f in remove_job_delete_nanny ()
> > #2 0x1013e5c in on_job_exit ()
> > #3 0x1028c24 in dispatch_task ()
> > #4 0x10042e7 in process_Dreply ()
> > #5 0x1039f3d in wait_request ()
> > #6 0x100f9c3 in main ()
> > #7 0x1001fa6 in _start ()
> > We run some kind of pre-release of Torque 2.1.2 on FreeBSD 4.10
> > This really worries me. This kind of broken fault tolerance can result
> > questioning if Torque is acceptable for mission-critical production
> > environment.
> > Did someone experience anything like this? Is it FreeBSD related? Is it
> > to fix?
> Can you check the MOM and server logs and see if the following could
> have happened?
> 1. job script exits on the node (natural exit or killed doesn't matter)
> 2. MOM sends JobObit to server
> 3. node dies
> 4. server tries to reply to the JobObit (can't connect, starts retrying)
> 5. node reboots and sends another JobObit
> I think that scenerio would result in on_job_exit() being a called a
> second time after the job was already removed and free()'d.
In this particular case jobs were in exiting state because moms tried to
deliver huge error files to faulty NFS, which made nodes unresponsive even
to ssh. So I put nodes offline and purged jobs (maybe I did not really need
to do so, but I wanted to get rid of them). After several hours admins
rebooted those boxes for us, which crashed the server.
I can try digging into log files, but they are huge and I don't have much
time at the moment. Please tell me if that scenario makes sense or I still
need to dig.
And I know that before we had several incidents with different versions of
Torque when server stopped responding or crashed due to messing with nodes
(even idle or offline ones with no jobs). I am not quite sure how different
our snapshot is from 2.1.2 release. If you are saying that fault tolerance
was improved just before the 2.1.2 release, then we can try upgrading and
see if it works better. Do you mean that if mom sends job exit event twice,
server crashes? If so was it fixed in 2.1.2?
More information about the torqueusers