[torqueusers] pbs_mom dies on exit of interactive session

Roy Dragseth roy.dragseth at cc.uit.no
Mon Apr 30 16:07:45 MDT 2012


Here is a job excerpt demonstrating error messages when running pbsdsh.
The pbs_mom thread die if you try to run pbsdsh -u.

dmesg shows a segfault


marve at hpc1 ~]$ qsub -I -lnodes=2:ppn=2,walltime=1000
qsub: waiting for job 13.hpc1.cc.uit.no to start
qsub: job 13.hpc1.cc.uit.no ready

[marve at compute-0-2 ~]$ pbsdsh uname -a
Linux compute-0-2.local 2.6.32-220.7.1.el6.x86_64 #1 SMP Wed Mar 7 00:52:02 
GMT 2012 x86_64 x86_64 x86_64 GNU/Linux
pbsdsh: Event poll failed, error TM_ENOTCONNECTED
Linux compute-0-0.local 2.6.32-220.7.1.el6.x86_64 #1 SMP Wed Mar 7 00:52:02 
GMT 2012 x86_64 x86_64 x86_64 GNU/Linux
Linux compute-0-2.local 2.6.32-220.7.1.el6.x86_64 #1 SMP Wed Mar 7 00:52:02 
GMT 2012 x86_64 x86_64 x86_64 GNU/Linux
Linux compute-0-0.local 2.6.32-220.7.1.el6.x86_64 #1 SMP Wed Mar 7 00:52:02 
GMT 2012 x86_64 x86_64 x86_64 GNU/Linux
pbsdsh: reconnected
pbsdsh: Event poll failed, error TM_ENOTFOUND
[marve at compute-0-2 ~]$ 
[marve at compute-0-2 ~]$ pbsdsh -u uname -a
[marve at compute-0-2 ~]$ pbsdsh -u uname -a
[marve at compute-0-2 ~]$ dmesg | tail -n1
pbs_mom[1980]: segfault at 20 ip 000000000040b240 sp 00007fffc1853820 error 4 
in pbs_mom[400000+5f000]
[marve at compute-0-2 ~]$ logout

qsub: job 13.hpc1.cc.uit.no completed

This is using torque-4.0.1-snap.201204031702.tar.gz

(the problems related to getting 4.0.X up an running seems to be related to 
the fact that pbs_server now only listens to one interface, earlier it 
listened to all interfaces.  I'll post a separate report for this)

r.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120501/c3f9a4a0/attachment.html 


More information about the torqueusers mailing list