[torquedev] Trunk And Multithreading
glen.beane at gmail.com
Fri Dec 10 05:54:00 MST 2010
2010/12/10 "Mgr. Šimon Tóth" <SimonT at mail.muni.cz>:
>> I've been working on making TORQUE multithreaded, as Ken mentioned. If anyone would like to download and test this code, it would be most appreciated. In order to use it, you have to do a few things:
>> 1. Add to your configure options --enable-pthreads. Currently the threading options are all encased in #ifdefs. This will not be the case when the code is released, but for now they are there.
>> 2. Set the parameters min_threads, max_threads, and thread_idle_seconds via qmgr. These are server parameters. min_threads and max_threads are the minimum and maximum number of threads, and they both default to 5. thread_idle_seconds defaults to -1, meaning the threads will stay idle forever without exiting. If this value is set to 300, then the threads will exit if they are idle for 300 seconds. I don't know if these values will be the defaults when the code is released, but for now those are the defaults.
>> So far, qstat, job obits, quejob, pbsnodes, and many other batch requests are all handled by threads. Node statuses received from moms are also handled by threads. I have tested this:
>> -10 moms are running on one box.
>> -watch -n 1 qstat is running
>> -watch -n 1 pbsnodes -a is running
>> -a script is running which submits 100 jobs that simple execute the ls command, and then runs all 100. This continues indefinitely.
>> I have been able to find intermittent failures in communication which cause temporary problems, but I was able to run 40,000 jobs in 36 minutes without any crashes.
>> One thing you'll likely notice quickly is that qstat may report jobs out of order. This is because the jobs are no longer stored in order, since they are stored in a global, resizable array with synchronized access instead of a global linked list. This had to be done to prevent crashes, as a linked list of size n with 2n distributed links is impossible (or very close to impossible) to make threadsafe.
>> This code involves changes throughout TORQUE; while it is a significant improvement to the code, it needs lots of testing. Any help that anyone in the community can provide will be greatly appreciated. If you do test, please check the code out and update frequently because there are likely to be changes made every day or close to every day.
> It would be great if this would be in a separate branch, its kind of
> hard to follow in trunk.
> I'm personally scared. How do you handle all the concurrency stuff?
> Pretty much every function is working with global variables with no
> protection at all.
> For example I think that the code now allows two jobs with identical
> ids, since several threads can pass through the find_job(jid)
> consequently with all getting NULL as the result (and then all
> continuing to add the job).
I agree with a branch. These changes make me nervous, and I think
they will take a while to get right. I'm also not excited about the
resizable array and storing jobs out of order - can we come up with a
better way (even if it is more difficult to code)?
More information about the torquedev