[torqueusers] torque pbs_mom segfaults

Lukasz Flis l.flis at cyf-kr.edu.pl
Sat Oct 13 18:32:07 MDT 2012


Dear All,


We have observed few pbs_mom crashes which are related to mom-to-mom
communication. We haven't managed to replicate this issue but it seems
it is related to applications which are using TM interface on multiple
nodes (OpenMPI) and one of the processes segfaults.

Our torque version affected by this bug is: 2.5.12

We have filled support ticket for moab/torque, however I'd like to hear
from you if you have ever encountered such an error.

Please find the text file with more details in the attachment.


It's worth to note that even if your pbs_mom or server has crashed with
segfault and didn't dump core file - it is still possible to locate
place in the code where bad happened.

Just use /proc/<pid>/smaps of running mom to find which library or
program owns the page where RIP/EIP is pointing.
Calculate relative rip/eip address and use addr2line to find out line of
code where program crashed.
Binaries with debug symbols will be needed for that (torque-debug
package is sufficient)

Regards,
--
Lukasz Flis



-------------- next part --------------
A non-text attachment was scrubbed...
Name: torque-mom-crash-c.log
Type: text/x-log
Size: 10262 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20121014/e1b6bf1e/attachment-0001.bin 


More information about the torqueusers mailing list