Bugzilla – Bug 208
pbs_mom segfaults in tm_request
Last modified: 2013-02-13 18:34:07 MST
You need to log in before you can comment on or make changes to this bug.
We see a large number of segfaults in pbs_mom. This is the syslog entry: Jul 22 02:25:35 b177 pbs_mom: LOG_ERROR::Bad file descriptor (9) in tm_request, comm failed Protocol failure in commit Jul 22 02:25:35 b177 kernel: pbs_mom[12931]: segfault at 0000000000000008 rip 00002b13f903a5ef rsp 00007fff035e0990 error 4 This problem exists (at least) in torque 2.5.11 and 3.0.5. As a consequence we are losing jobs on the cluster: =>> PBS: job killed: node 28 (b177) requested job terminate, 'EOF' (code 1099) - received SISTER_EOF attempting to communicate with sister MOM's mpiexec: killing job... - Martin
Martin, Do you happen to have a back trace of the core for this?
Martin, Try this patch: https://github.com/adaptivecomputing/torque/commit/d2df9d4909e7a54b9633738ccedba8459a678e2f It fixed the problem for us. -- Lukasz Flis
Martin, To confirm if it is the bug related to the patch try this: submit interactive job for 12 nodes and 12 cores per node: (the more - the better chance of hitting this) qsub -I -l nodes=12:ppn=12 once granted a shell start the following loop: for i in `seq 1 20`; do pbsdsh /usr/bin/id; done; Hit ctrl+c several times during the loop. Cheers -- LKF