Bug 208 - pbs_mom segfaults in tm_request
: pbs_mom segfaults in tm_request
Status: NEW
Product: TORQUE
pbs_mom
: 2.5.x
: PC Linux
: P5 critical
Assigned To: Ken Nielson
:
:
:
  Show dependency treegraph
 
Reported: 2012-07-24 13:10 MDT by Martin Siegert
Modified: 2013-02-13 18:34 MST (History)
2 users (show)

See Also:


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description Martin Siegert 2012-07-24 13:10:41 MDT
We see a large number of segfaults in pbs_mom. This is the syslog entry:

Jul 22 02:25:35 b177 pbs_mom: LOG_ERROR::Bad file descriptor (9) in tm_request,
comm failed Protocol failure in commit
Jul 22 02:25:35 b177 kernel: pbs_mom[12931]: segfault at 0000000000000008 rip
00002b13f903a5ef rsp 00007fff035e0990 error 4

This problem exists (at least) in torque 2.5.11 and 3.0.5.

As a consequence we are losing jobs on the cluster:

=>> PBS: job killed: node 28 (b177) requested job terminate, 'EOF' (code
1099) - received SISTER_EOF attempting to communicate with sister MOM's
mpiexec: killing job...

- Martin
Comment 1 Ken Nielson 2012-08-03 11:25:36 MDT
Martin,

Do you happen to have a back trace of the core for this?
Comment 2 Lukasz Flis 2013-02-13 17:19:18 MST
Martin,

Try this patch:
https://github.com/adaptivecomputing/torque/commit/d2df9d4909e7a54b9633738ccedba8459a678e2f

It fixed the problem for us. 

--
Lukasz Flis
Comment 3 Lukasz Flis 2013-02-13 18:34:07 MST
Martin, 
To confirm if it is the bug related to the patch try this:

submit interactive job for 12 nodes and 12 cores per node:

(the more - the better chance of hitting this)

qsub -I -l nodes=12:ppn=12

once granted a shell start the following loop:

for i in `seq 1 20`; do pbsdsh /usr/bin/id; done;

Hit ctrl+c several times during the loop.

Cheers
--
LKF