[torquedev] pbm_mom segfault in TMomCheckJobChild

Michael Barnes Michael.Barnes at jlab.org
Mon Dec 22 07:06:02 MST 2008


On Sat, Dec 20, 2008 at 02:11:13PM -0800, Joshua Bernstein wrote:
> >how are you reproducing the failure?
> 
> After running say 500 to 1000 very short jobs through a stock queue  
> configuration, eventually one, then all of the pbs_mom's either throw  
> an out of file descriptors error, and sit eating 100% of a CPU in an  
> epoll() loop. Or sometimes they simply segfault if given more fd's.  
> Its really pretty easy to replicate. This of course is seen with the  
> 2.3.3 release, though I've been able to reproduce it at another site  
> with 2.3.5 and 2.1.9, and even 2.4.0.

Unless you are launching the jobs through the tm interface, which
I am not, this is the only difference I can think of between the
environments, besides the OS, I can say that with 2.1.9, I do not see
this behavior at all.

I have 3 clusters from 400 opterons, 320 intel, both with the pbs_mom
compiled in 64bit mode, and 256 32bit nodes all connected to the same
pbs_server. I also have ~200 32bit intel boxes connected to another
pbs_server, and I have not seen a pbs_mom hickup or failure since I
installed this version of torque (again 2.1.9) I guesss almost a year
now.

After hearing about these file descriptor leaks, I ran lsof on one of my
pbs_moms, and it only showed 5 or so. Shared libraries, cwd, all of the
expected open files.

I'm not saying that there are no bugs in torque, but I've never
experienced nor heard of these kinds of problems before, and I would
think that if this were a blatent bug in torque, then other sites would
have reported it by now.

Regards,

-mb

-- 
+-----------------------------------------------
| Michael Barnes
|
| Thomas Jefferson National Accelerator Facility
| 12000 Jefferson Ave.
| Newport News, VA 23606
| (757) 269-7634
+-----------------------------------------------


More information about the torquedev mailing list