[torqueusers] stability again

Alexander Saydakov saydakov at yahoo-inc.com
Fri Sep 29 11:38:39 MDT 2006



I would like to report another incident when rebooting of a few nodes
resulted in server crash. Those nodes became unresponsive because of some
other problem, not related to Torque in any way, were put offline and
rebooted. This is not the first time when loosing nodes made server
unresponsive or led to core dump.


Core was generated by `pbs_server'.

Program terminated with signal 11, Segmentation fault.

Reading symbols from /usr/lib/libkvm.so.2...done.

Reading symbols from /usr/lib/libc.so.4...done.

Reading symbols from /usr/libexec/ld-elf.so.1...done.

#0  0x1038636 in get_next ()

(gdb) bt

#0  0x1038636 in get_next ()

#1  0x1012a7f in remove_job_delete_nanny ()

#2  0x1013e5c in on_job_exit ()

#3  0x1028c24 in dispatch_task ()

#4  0x10042e7 in process_Dreply ()

#5  0x1039f3d in wait_request ()

#6  0x100f9c3 in main ()

#7  0x1001fa6 in _start ()


We run some kind of pre-release of Torque 2.1.2 on FreeBSD 4.10


This really worries me. This kind of broken fault tolerance can result in
questioning if Torque is acceptable for mission-critical production

Did someone experience anything like this? Is it FreeBSD related? Is it hard
to fix?



-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20060929/d3bd63c6/attachment.html

More information about the torqueusers mailing list