Bugzilla – Bug 208
pbs_mom segfaults in tm_request
Last modified: 2013-02-13 18:34:07 MST
You need to
before you can comment on or make changes to this bug.
We see a large number of segfaults in pbs_mom. This is the syslog entry:
Jul 22 02:25:35 b177 pbs_mom: LOG_ERROR::Bad file descriptor (9) in tm_request,
comm failed Protocol failure in commit
Jul 22 02:25:35 b177 kernel: pbs_mom: segfault at 0000000000000008 rip
00002b13f903a5ef rsp 00007fff035e0990 error 4
This problem exists (at least) in torque 2.5.11 and 3.0.5.
As a consequence we are losing jobs on the cluster:
=>> PBS: job killed: node 28 (b177) requested job terminate, 'EOF' (code
1099) - received SISTER_EOF attempting to communicate with sister MOM's
mpiexec: killing job...
Do you happen to have a back trace of the core for this?
Try this patch:
It fixed the problem for us.
To confirm if it is the bug related to the patch try this:
submit interactive job for 12 nodes and 12 cores per node:
(the more - the better chance of hitting this)
qsub -I -l nodes=12:ppn=12
once granted a shell start the following loop:
for i in `seq 1 20`; do pbsdsh /usr/bin/id; done;
Hit ctrl+c several times during the loop.