[torquedev] Trunk And Multithreading

David Beer dbeer at adaptivecomputing.com
Fri Dec 10 09:58:07 MST 2010

----- Original Message -----
> > I have been able to find intermittent failures in communication
> > which cause temporary problems, but I was able to run 40,000 jobs in
> > 36 minutes without any crashes.
> Please don't forget to test the code in valgrinds DRD and Hellgrind
> modules to verify, that there aren't any hidden race conditions or
> other
> thread related errors.

I assure you I am using all of these tools to try to improve things. I don't think I would've ever gotten this far without valgrind. That is some beautiful software. There are definitely still things to be worked out. For example, TORQUE still has memory errors in valgrind (it has for years). I am checking constantly to avoid adding more, but there is work to be done. The helgrind tool reports errors as well, as I haven't protected some of the lesser variable yet. For example, jobs are protected, but time_now isn't yet. There are some more significant things that need protecting - I need to rework the way I'm protecting things in the rpp protocols as I believe this is what is causing intermittent failures - but right now TORQUE recovers from all of these minor things and largely works. It isn't ready for release. I'm not trying to present a finished product, I'm asking for some help in testing what we have and refining the solution.

> Btw. what is the motives for making Torque threaded? Plain Torque can
> easily process over 5,000 jobs per 10 minutes. That includes includes
> submit and scheduling time (pbs_sched), plain server is much faster.

I'm afraid I have to argue with your math here. 5,000 jobs in 10 minutes is about 8.3 jobs per second. 40,000 jobs in 36 minutes is about 18.5 jobs per second. My test isn't using a scheduler, but it is a script that qruns all of the jobs. However, I'm sure this is made up for by the fact that pbs_server is servicing a qstat and a pbsnodes -a request every second. The multi-threaded server is about twice as fast, and if you mix in the requests, it is exponentially faster. In a single-threaded TORQUE, nothing happens while pbs_server responds to a request. If you submit a qstat, a qrun, a qsub, and a pbsnodes at once, they get handled serially, often causing timeouts and other problems. Multiple threads decreases latency and increases throughput. 

Furthermore, it is possible to crash pbs_server by submitting many qsubs at the same time. We had a client that submitted jobs from a Makefile, submitting 20 or more jobs simultaneously. Once each qsub got a connection to the server, it returned, allowing the next one to get a connection to the server. With enough simultaneous submissions, this crashes the server. Not only are we increasing speed, we are making things more stable.

David Beer 
Direct Line: 801-717-3386 | Fax: 801-717-3738
     Adaptive Computing
     1656 S. East Bay Blvd. Suite #300
     Provo, UT 84606

More information about the torquedev mailing list