[torqueusers] stability again

Garrick Staples garrick at clusterresources.com
Fri Sep 29 13:43:06 MDT 2006


On Fri, Sep 29, 2006 at 10:38:39AM -0700, Alexander Saydakov alleged:
> Hi!
> 
>  
> 
> I would like to report another incident when rebooting of a few nodes
> resulted in server crash. Those nodes became unresponsive because of some
> other problem, not related to Torque in any way, were put offline and
> rebooted. This is not the first time when loosing nodes made server
> unresponsive or led to core dump.
> 
>  
> 
> Core was generated by `pbs_server'.
> 
> Program terminated with signal 11, Segmentation fault.
> 
> Reading symbols from /usr/lib/libkvm.so.2...done.
> 
> Reading symbols from /usr/lib/libc.so.4...done.
> 
> Reading symbols from /usr/libexec/ld-elf.so.1...done.
> 
> #0  0x1038636 in get_next ()
> 
> (gdb) bt
> 
> #0  0x1038636 in get_next ()
> 
> #1  0x1012a7f in remove_job_delete_nanny ()
> 
> #2  0x1013e5c in on_job_exit ()
> 
> #3  0x1028c24 in dispatch_task ()
> 
> #4  0x10042e7 in process_Dreply ()
> 
> #5  0x1039f3d in wait_request ()
> 
> #6  0x100f9c3 in main ()
> 
> #7  0x1001fa6 in _start ()
> 
>  
> 
> We run some kind of pre-release of Torque 2.1.2 on FreeBSD 4.10
> 
>  
> 
> This really worries me. This kind of broken fault tolerance can result in
> questioning if Torque is acceptable for mission-critical production
> environment.
> 
> Did someone experience anything like this? Is it FreeBSD related? Is it hard
> to fix?

I'm pretty sure I got this stuff fixed up in the 2.1.2 release.  I have
had nodes be rebooted during jobs dozens of times with 2.1.2 without an
issue.  Can you do a diff between your source and the 2.1.2 tarball?

Are you using keep_completed?  Was the job forcibly purged?



More information about the torqueusers mailing list