[torquedev] Trunk And Multithreading
knielson at adaptivecomputing.com
Fri Dec 10 09:31:07 MST 2010
On 12/10/2010 04:55 AM, "Mgr. Šimon Tóth" wrote:
>> I've been working on making TORQUE multithreaded, as Ken mentioned. If anyone would like to download and test this code, it would be most appreciated. In order to use it, you have to do a few things:
>> 1. Add to your configure options --enable-pthreads. Currently the threading options are all encased in #ifdefs. This will not be the case when the code is released, but for now they are there.
>> 2. Set the parameters min_threads, max_threads, and thread_idle_seconds via qmgr. These are server parameters. min_threads and max_threads are the minimum and maximum number of threads, and they both default to 5. thread_idle_seconds defaults to -1, meaning the threads will stay idle forever without exiting. If this value is set to 300, then the threads will exit if they are idle for 300 seconds. I don't know if these values will be the defaults when the code is released, but for now those are the defaults.
>> So far, qstat, job obits, quejob, pbsnodes, and many other batch requests are all handled by threads. Node statuses received from moms are also handled by threads. I have tested this:
>> -10 moms are running on one box.
>> -watch -n 1 qstat is running
>> -watch -n 1 pbsnodes -a is running
>> -a script is running which submits 100 jobs that simple execute the ls command, and then runs all 100. This continues indefinitely.
>> I have been able to find intermittent failures in communication which cause temporary problems, but I was able to run 40,000 jobs in 36 minutes without any crashes.
>> One thing you'll likely notice quickly is that qstat may report jobs out of order. This is because the jobs are no longer stored in order, since they are stored in a global, resizable array with synchronized access instead of a global linked list. This had to be done to prevent crashes, as a linked list of size n with 2n distributed links is impossible (or very close to impossible) to make threadsafe.
>> This code involves changes throughout TORQUE; while it is a significant improvement to the code, it needs lots of testing. Any help that anyone in the community can provide will be greatly appreciated. If you do test, please check the code out and update frequently because there are likely to be changes made every day or close to every day.
> It would be great if this would be in a separate branch, its kind of
> hard to follow in trunk.
> I'm personally scared. How do you handle all the concurrency stuff?
> Pretty much every function is working with global variables with no
> protection at all.
> For example I think that the code now allows two jobs with identical
> ids, since several threads can pass through the find_job(jid)
> consequently with all getting NULL as the result (and then all
> continuing to add the job).
Trunk is the proper place for all of this. Trunk by convention is the
branch for latest development. It is the unstable branch. It is more or
less the alpha to beta code.
We have been studying the single threaded, unprotected global variable
nature of TORQUE for over a year. We have made some test versions as
proof of concept and we have found that it is not the intractable
problem we thought it would be. David has been able to get the code in
trunk into a stable condition and it is ready for anyone who is curious
enough to try it.
Even though the server is now multi-threaded each individual job is
submitted on its own thread. We cannot create two jobs with the same id.
(Never say never) It would be a good test case to see what happens when
two users request to run the same job at the same time. The result
should be that one succeeds and the other fails with an error that the
job is already running.
More information about the torquedev