[torqueusers] torque not scaling well

Ronny T. Lampert telecaadmin at gmail.com
Thu Aug 2 01:49:50 MDT 2007


> Running max debug shows that a scheduling
> pass goes from well under 1 second below
> 1500 jobs or so up to 10 or 15 minutes
> as the queue length increases, and it's
> pretty much all spent waiting on pbs_server
> to respond to maui.  We're not yet sure
> what that means, but it does really make
> us wonder about torque.

I might chime in, I observed a smiliar thing quite a while ago.
Back in 1.2.X days the server used to provide *1* job-information per 
request.
Each request was handled via select() (or poll()) + some overhead with 
stat()ing et al) in the server and took rather long to complete.

I had pbs_sched running back then and both took more than 1 second 
(because of this select()ing on both sided + timeouts) per job to 
transfer from pbs_server to pbs_sched!

strace on Linux can do wonders to find out what's going on 
system-call-wise - maybe you should take a dump and send it to someone 
who wants to analyze it.

Cheers,
Ronny


More information about the torqueusers mailing list