[torquedev] Trunk And Multithreading

"Mgr. Šimon Tóth" SimonT at mail.muni.cz
Sat Dec 11 08:44:11 MST 2010

>>> I have been able to find intermittent failures in communication
>>> which cause temporary problems, but I was able to run 40,000 jobs in
>>> 36 minutes without any crashes.
>> Please don't forget to test the code in valgrinds DRD and Hellgrind
>> modules to verify, that there aren't any hidden race conditions or
>> other
>> thread related errors.
> I assure you I am using all of these tools to try to improve things. I don't think I would've ever gotten this far without valgrind. That is some beautiful software. There are definitely still things to be worked out. For example, TORQUE still has memory errors in valgrind (it has for years). I am checking constantly to avoid adding more, but there is work to be done. The helgrind tool reports errors as well, as I haven't protected some of the lesser variable yet. For example, jobs are protected, but time_now isn't yet. There are some more significant things that need protecting - I need to rework the way I'm protecting things in the rpp protocols as I believe this is what is causing intermittent failures - but right now TORQUE recovers from all of these minor things and largely works. It isn't ready for release. I'm not trying to present a finished product, I'm asking for some help in testing what we have and refining the solution.
>> Btw. what is the motives for making Torque threaded? Plain Torque can
>> easily process over 5,000 jobs per 10 minutes. That includes includes
>> submit and scheduling time (pbs_sched), plain server is much faster.
> I'm afraid I have to argue with your math here. 5,000 jobs in 10 minutes is about 8.3 jobs per second. 40,000 jobs in 36 minutes is about 18.5 jobs per second. My test isn't using a scheduler, but it is a script that qruns all of the jobs. However, I'm sure this is made up for by the fact that pbs_server is servicing a qstat and a pbsnodes -a request every second. The multi-threaded server is about twice as fast, and if you mix in the requests, it is exponentially faster. In a single-threaded TORQUE, nothing happens while pbs_server responds to a request. If you submit a qstat, a qrun, a qsub, and a pbsnodes at once, they get handled serially, often causing timeouts and other problems. Multiple threads decreases latency and increases throughput. 

I'm certainly not claiming that 5k jobs per 10 minutes is more than 40k
jobs per 36 minutes. What I'm asking though, is the double-triple speed
increase worth the trouble?

I'm definitely looking forward to some tests (I might do some myself),
but seriously unless there will be something like a 10-100x speed
increase, the threads are not worth it. I know about several trivial
optimizations that would improve the servers speed a lot without
creating ANY problems in the code. One that I'm heavily considering to
add into our production environment is forking each status request.

Mgr. Šimon Tóth

More information about the torquedev mailing list