[torquedev] pbm_mom segfault in TMomCheckJobChild

Josh Butikofer josh at clusterresources.com
Mon Dec 22 13:10:05 MST 2008


I figure I might as well put in my two bits. :)

First of all, I believe Josh Bernstein when he says that he sees a crash 
and then, after making changes to the code, the crash goes away. I can't 
think of any reason he'd lie to us! :)

With that said, however, I must agree that it is unclear why the changes 
Josh is making would prevent a segfault. Like Glen said, there must be 
something else going on, but fixing the code like this changes behavior 
slightly or does something to help reduce/eliminate a crash. Maybe it 
offsets a race condition or something? That seems doubtful. I'm just not 
sure (haven't had the proper time to look into it). That's why I think 
we should continue trying to pin down exactly what is going on and fix 
it in a way that we can better confirm we are fixing an observable 
behavior leading to the crash. I contacted Josh out of band and got some 
of the core files from him to try and get more info about the crash.

> I agree Michael. Up to this point, I've only experienced a few minor 
> bugs in TORQUE, but certainly nothing this drastic. I believe that this 
> particular issue revolves around the workload, many thousands of short, 
> small jobs. I imagine on a much larger cluster, such as yours, the jobs 
> run for much longer.
> 
> That all said, I'm surprised that others haven't reported similar 
> issues. Though, just because nobody has reported it, doesn't mean its 
> not happening. It may just mean they've switched to something like SGE 
> or PBSPro. At least thats what I've heard from others. ;-)

As to whether others have seen a problem like this, I can say that 
similar issue *has* been experienced by customers who, like Josh, try to 
run thousands of very short jobs (1-5 seconds) as fast as possible. Most 
of the problems we saw, however, were actually in the pbs_server and not 
the pbs_mom. Basically, the pbs_server would run out of sockets and 
errors would show up in the log file about not being able to open up 
other kinds of files. Also, sometimes the socket descriptors themselves 
would not leak, but the socket handlers inside of TORQUE would leak. 
These problems, so far, have been addressed by two "workarounds":

1) Making TORQUE so that it can support more than 1024 descriptors at a 
time.

2) Increasing the size of the socket handler table.

Both of these fixes are only currently present in the 2.3-extreme and 
2.3-yahoo branches.

The problem seemed to occur in our labs in pbs_server when you have a 
LOT of jobs (thousands) all running at the same time (or close to the 
same time). I bet if you had a lot of jobs running on one pbs_mom you 
could have a similar problem happening in the other direction.

Anyway, if Josh has a reproducible case, which it sounds like he does, 
this is a perfect time to fix the problem. We could never easily 
reproduce the issue.

One question remains for me remains: do we wait to release 2.3.6 for 
this fix, considering that it may take until after the holidays to fully 
understand and fix? Or do we try and roll out 2.3.6 before Christmas 
hits? Most sane admins wouldn't upgrade before the holidays anyway ... 
but who knows. :)

--Josh B.


More information about the torquedev mailing list