[torqueusers] Torque 2.1.x pbs_server process hogging cpu

Martin Schafföner martin.schaffoener at e-technik.uni-magdeburg.de
Thu Jun 15 04:51:31 MDT 2006


On Thursday 15 June 2006 10:38, garrick at speculation.org wrote:

> IMHO, bindresvport() doesn't like the link-local line; which sounds to
> me like it could be a linux kernel bug.  Of course, this isn't really
> my area of expertise and I could be completely wrong.

I would think that this can't be the problem cause there are other services on 
linux (I read somewhere that the tcpwrappers or something in that area) are 
using bindresvport(), too, and they are working well. But sometimes it's the 
unlikely...

What bothers me, though, is the fact that client_to_svr() succeeds when 
connecting to the scheduler (moab in my case) or when starting the job on a 
node, but that it fails only at the second attempt to connect to the mom. 
BTW, I also noticed moms infinetily looping after jobs had finished, 
apparently when they were trying to notify the server.

> The original bindresvport() code in TORQUE otherwise seems fine.  This

I think so, too.

> is the first I've heard of it not working, so I'm willing to think this
> is a weird corner case with your configuration.

I have to protest! Everything is configured right [tm] on my cluster!

> If you have support with Novell, you may want to put together a small
> test case (socket(), bindresvport(), connect()) and take it up with
> them.

I guess I would have to setup a listener (like with netcat) and write this 
small test case looping around. Maybe next week or so...

> Or we just live with the new code, be happy it works, and not worry
> about it :)

For the moment, yes.

Bye,
-- 
Martin Schafföner

Cognitive Systems Group, Institute of Electronics, Signal Processing and 
Communication Technologies, Department of Electrical Engineering, 
Otto-von-Guericke University Magdeburg
Phone: +49 391 6720063


More information about the torqueusers mailing list