[torquedev] pbm_mom segfault in TMomCheckJobChild

Joshua Bernstein jbernstein at penguincomputing.com
Mon Dec 22 12:20:37 MST 2008



Michael Barnes wrote:
> On Sat, Dec 20, 2008 at 02:11:13PM -0800, Joshua Bernstein wrote:
>>> how are you reproducing the failure?
>> After running say 500 to 1000 very short jobs through a stock queue  
>> configuration, eventually one, then all of the pbs_mom's either throw  
>> an out of file descriptors error, and sit eating 100% of a CPU in an  
>> epoll() loop. Or sometimes they simply segfault if given more fd's.  
>> Its really pretty easy to replicate. This of course is seen with the  
>> 2.3.3 release, though I've been able to reproduce it at another site  
>> with 2.3.5 and 2.1.9, and even 2.4.0.
> 
> Unless you are launching the jobs through the tm interface, which
> I am not, this is the only difference I can think of between the
> environments, besides the OS, I can say that with 2.1.9, I do not see
> this behavior at all.
> 
> I have 3 clusters from 400 opterons, 320 intel, both with the pbs_mom
> compiled in 64bit mode, and 256 32bit nodes all connected to the same
> pbs_server. I also have ~200 32bit intel boxes connected to another
> pbs_server, and I have not seen a pbs_mom hickup or failure since I
> installed this version of torque (again 2.1.9) I guesss almost a year
> now.
> 
> After hearing about these file descriptor leaks, I ran lsof on one of my
> pbs_moms, and it only showed 5 or so. Shared libraries, cwd, all of the
> expected open files.
> 
> I'm not saying that there are no bugs in torque, but I've never
> experienced nor heard of these kinds of problems before, and I would
> think that if this were a blatent bug in torque, then other sites would
> have reported it by now.

I agree Michael. Up to this point, I've only experienced a few minor 
bugs in TORQUE, but certainly nothing this drastic. I believe that this 
particular issue revolves around the workload, many thousands of short, 
small jobs. I imagine on a much larger cluster, such as yours, the jobs 
run for much longer.

That all said, I'm surprised that others haven't reported similar 
issues. Though, just because nobody has reported it, doesn't mean its 
not happening. It may just mean they've switched to something like SGE 
or PBSPro. At least thats what I've heard from others. ;-)

-Joshua Bernstein
Software Engineer
Penguin Computing


More information about the torquedev mailing list