[torqueusers] stability again

Garrick Staples garrick at clusterresources.com
Fri Sep 29 14:22:05 MDT 2006


On Fri, Sep 29, 2006 at 10:38:39AM -0700, Alexander Saydakov alleged:
> Hi!
> 
>  
> 
> I would like to report another incident when rebooting of a few nodes
> resulted in server crash. Those nodes became unresponsive because of some
> other problem, not related to Torque in any way, were put offline and
> rebooted. This is not the first time when loosing nodes made server
> unresponsive or led to core dump.
> 
>  
> 
> Core was generated by `pbs_server'.
> 
> Program terminated with signal 11, Segmentation fault.
> 
> Reading symbols from /usr/lib/libkvm.so.2...done.
> 
> Reading symbols from /usr/lib/libc.so.4...done.
> 
> Reading symbols from /usr/libexec/ld-elf.so.1...done.
> 
> #0  0x1038636 in get_next ()
> 
> (gdb) bt
> 
> #0  0x1038636 in get_next ()
> 
> #1  0x1012a7f in remove_job_delete_nanny ()
> 
> #2  0x1013e5c in on_job_exit ()
> 
> #3  0x1028c24 in dispatch_task ()
> 
> #4  0x10042e7 in process_Dreply ()
> 
> #5  0x1039f3d in wait_request ()
> 
> #6  0x100f9c3 in main ()
> 
> #7  0x1001fa6 in _start ()
> 
>  
> 
> We run some kind of pre-release of Torque 2.1.2 on FreeBSD 4.10
> 
>  
> 
> This really worries me. This kind of broken fault tolerance can result in
> questioning if Torque is acceptable for mission-critical production
> environment.
> 
> Did someone experience anything like this? Is it FreeBSD related? Is it hard
> to fix?

Can you check the MOM and server logs and see if the following could
have happened?
  1. job script exits on the node (natural exit or killed doesn't matter)
  2. MOM sends JobObit to server
  3. node dies
  4. server tries to reply to the JobObit (can't connect, starts retrying)
  5. node reboots and sends another JobObit

I think that scenerio would result in on_job_exit() being a called a
second time after the job was already removed and free()'d.




More information about the torqueusers mailing list