Bugzilla – Bug 95
Support for GPUs
Last modified: 2010-11-18 10:04:50 MST
You need to log in before you can comment on or make changes to this bug.
It seems that all that is needed for exclusive GPU access is changing the ownership of the graphic card device (for nvidia: /dev/nvidiaX). http://stackoverflow.com/questions/4077790/limiting-access-to-resources-for-cuda-and-opencl If this is true, then we can support the homogeneous use case of GPUs very simply (different cards in one machine require much more server and node logic). Counted resources are supported by Bug 67, that ensures correct assignment of jobs requesting GPUs. As for the node part, I see two possible approaches: 1) modifying the linux mom_mach.c file (presumably in mom_set_limits) to correctly find and chown the corresponding GPUs. I'm not sure about the cleanup part (maybe kill_task). 2) doing the GPU assignment/cleanup in the prologue/epilogue. Node code only sets environment variable GPU_COUNT, GPU_LIST (or similar). The second one might be preferred because it would allow users of Torque to easily write/modify their own implementations of the GPU assignment. Therefore making it easy to port it for a different GPU API.
This is something that we have been thinking about at Adaptive as well. Our thoughts for the first cycle through are a little bit differently than is suggested. In the first pass, we are planning to report GPUs just like pbs_server does with processor assignment. We are going to allow the user to specify gpus=X in the nodes file, and then TORQUE will track which GPU is assigned to which job and report it. I'm thinking that $PBS_NODEFILE will add a line for each gpu: hostname gpu<index> And from there, each job should handle things (like grabbing the correct gpu), just like TORQUE does by default with cpus. Eventually, we have a number of features we would like to add, but first we want to release the feature as I have described it, and once this is used and people have need for more, we will add those features. I am convinced that this feature will be easy to add to TORQUE (Ken and I are actually working on it right now and we will be done soon) and will significantly improve GPU usage for our users. From there, some of the features we want to add include the autodetection of gpus and exclusive access to the assigned gpus.
(In reply to comment #1) > This is something that we have been thinking about at Adaptive as well. Our > thoughts for the first cycle through are a little bit differently than is > suggested. > > In the first pass, we are planning to report GPUs just like pbs_server does > with processor assignment. We are going to allow the user to specify gpus=X in > the nodes file, and then TORQUE will track which GPU is assigned to which job > and report it. I'm thinking that $PBS_NODEFILE will add a line for each gpu: > > hostname gpu<index> > > And from there, each job should handle things (like grabbing the correct gpu), > just like TORQUE does by default with cpus. Eventually, we have a number of > features we would like to add, but first we want to release the feature as I > have described it, and once this is used and people have need for more, we will > add those features. I am convinced that this feature will be easy to add to > TORQUE (Ken and I are actually working on it right now and we will be done > soon) and will significantly improve GPU usage for our users. > > From there, some of the features we want to add include the autodetection of > gpus and exclusive access to the assigned gpus. Just to clarify, when you are talking about cpus, you mean cpus (from PBSPro) or ppn? Anyway, to be absolutely blunt. I don't like this approach very much. Doesn't look very generic.
> Just to clarify, when you are talking about cpus, you mean cpus (from PBSPro) > or ppn? > I'm talking about ppn/np. GPUs would be specified in the same way that you specify the ppn allowed on that node, which is when you do np=X in the nodes file. > Anyway, to be absolutely blunt. I don't like this approach very much. Doesn't > look very generic. Its okay if you don't like it, but please be more specific. This approach seems about as generic as possible to me, so without more details there isn't much for me to do with that.
(In reply to comment #3) > > Anyway, to be absolutely blunt. I don't like this approach very much. Doesn't > > look very generic. > > Its okay if you don't like it, but please be more specific. This approach seems > about as generic as possible to me, so without more details there isn't much > for me to do with that. Sorry, what I meant is that this sounds like adding a specific code for handling GPUs. What will be done if for example disk space will be added as next supported resource? More specific code for disk space?
>Counted resources are supported by Bug 67, that ensures correct assignment of >jobs requesting GPUs. By elevating the GPU to the same level as ppn the GPU is now a counted resource. Moreover, we can now create a node spec that can specifiy how many processors and GPUs are needed for a job. For example: qsub -l nodes=hostA:ppn=2:gpu=1 <job.sh> This will allocate two np and one gpu on hostA. We can do multiple node assignments as well. qsub -l nodes=2:ppn=2:gpu=1+2:ppn=2:gpu=2,mem=4Gb <job.sh> We have now requested two nodes with two np each and 1 gpu each plus 2 more nodes with two np and two gpu each. The configuration and syntax fit easily in the current TORQUE build. It is also generic as to what a gpu is. Later we can add the syntax to qsub to support exclusive access and other features of gpus. We could also add an auto-detect feature that would populate each host with the number of gpus available plus report statistics in pbsnodes for the gpus. Another advantage of this syntax is that it can fit easily into the existing tm interface. MPI would not need to make many changes if any at all to manage gpus on multiple MOMs.
(In reply to comment #5) > >Counted resources are supported by Bug 67, that ensures correct assignment of > >jobs requesting GPUs. > > By elevating the GPU to the same level as ppn the GPU is now a counted > resource. Moreover, we can now create a node spec that can specifiy how many > processors and GPUs are needed for a job. For example: > > qsub -l nodes=hostA:ppn=2:gpu=1 <job.sh> > > This will allocate two np and one gpu on hostA. We can do multiple node > assignments as well. > > qsub -l nodes=2:ppn=2:gpu=1+2:ppn=2:gpu=2,mem=4Gb <job.sh> > > We have now requested two nodes with two np each and 1 gpu each plus 2 more > nodes with two np and two gpu each. > > The configuration and syntax fit easily in the current TORQUE build. It is also > generic as to what a gpu is. > > Later we can add the syntax to qsub to support exclusive access and other > features of gpus. We could also add an auto-detect feature that would populate > each host with the number of gpus available plus report statistics in pbsnodes > for the gpus. > > Another advantage of this syntax is that it can fit easily into the existing tm > interface. MPI would not need to make many changes if any at all to manage gpus > on multiple MOMs. Uhm. OK, I'm really sorry but I really don't understand this post. Yes that is how my patch work. I know that. :-) I wrote it myself, so it would be kind of weird if I wouldn't. :-)
(In reply to comment #4) > (In reply to comment #3) > > Sorry, what I meant is that this sounds like adding a specific code for > handling GPUs. What will be done if for example disk space will be added as > next supported resource? More specific code for disk space? There is a small amount of code that needs to be added to support the gpu. It makes sense to put this in the nodes file with np because these are both processing type units. They will be requested similarly as well. We could go down the path of assigning all resources in the nodes file but the other resources like disk space, memory and such are not treated the same as np's within TORQUE.
> > Sorry, what I meant is that this sounds like adding a specific code for > > handling GPUs. What will be done if for example disk space will be added as > > next supported resource? More specific code for disk space? > > There is a small amount of code that needs to be added to support the gpu. It > makes sense to put this in the nodes file with np because these are both > processing type units. They will be requested similarly as well. Well, any resource should be put into the nodes file. Specifying the resources on the server instead on the node makes sense. I personally don't really like the resources specified in node configuration file (although I guess, some admins might prefer this). > We could go down the path of assigning all resources in the nodes file but the > other resources like disk space, memory and such are not treated the same as > np's within TORQUE. Well yes. NP's are treated as slots. That makes sense for CPUs, but not really for anything else. GPUs can be still treated as slots, but there won't be no slot mapping (because each slot for NP has its own sub-node).
Hi All, Someone pointed me to this ticket and I have some more info that I will dump as it may help (or at least prevent some headaches). Using the current Cuda 3.1 drivers, locking the permissions to one driver will lock all devices. So you would only be able to lock all or none GPU resources to a user. This has been changed in Cuda 3.2 as NVIDIA also think it is a good idea to be able to lock individual GPUs to users. Below is a copy of some investigation work I did into the new feature, more info will be available at http://developer.nvidia.com/object/cuda_3_2_toolkit_rc.html. -=snip=- GPU to process locking (Vizworld "allow admins to lock processes to certain GPU’s") Good news - it may be now possible to locked GPU's to user jobs via some chmod queue scripting I don't know where VisWorld got the statement as it is not mentioned anywhere in the release notes... however... There is talk of 3.2 better handling user access restrictions via permissions to /dev/nvidiactl "- a user can now access a subset of GPUs by having RW privileges to /dev/nvidiactl and RW privileges to only a subset of the /dev/nvidia[0...n] rather than having the CUDA driver throw an error if you can't access any of the nodes; devices that a user doesn't have permissions to will not be visible to the app (think CUDA_VISIBLE_DEVICES version 2.0)" http://forums.nvidia.com/lofiversion/index.php?t180547.html Looking at CUDA_VISIBLE_DEVICES in 3.1 we see that the environment variabl can restrict what devices a user sees http://developer.download.nvidia.com/compute/cuda/3_1/toolkit/docs/cudatoolkit_release_notes_linux.txt -=snip=- I currently have a dual Tesla C1060 node set-up with Cuda 3.2 RC - if you want me to test anything let me know. Cheers, Paul paulmc@vpac.org
Can I sound a note of caution about modifying the syntax of $PBS_NODEFILE, I suspect you will find a lot of people/programs process it on the assumption that it's just one line per CPU per host and just has the hostname in it. Things like Fluent, etc..
(In reply to comment #10) > Can I sound a note of caution about modifying the syntax of $PBS_NODEFILE, I > suspect you will find a lot of people/programs process it on the assumption > that it's just one line per CPU per host and just has the hostname in it. > > Things like Fluent, etc.. I'm not sure what is planned by the core developers, but my patch only extends the syntax, while staying on one line: node.hostname property1 property2 resources_total.mem=4G resources_total.gpu=2 resources_total.gpu_type=tesla I guess the official patch will work in a similar fashion.
(In reply to comment #10) > Can I sound a note of caution about modifying the syntax of $PBS_NODEFILE, I > suspect you will find a lot of people/programs process it on the assumption > that it's just one line per CPU per host and just has the hostname in it. > > Things like Fluent, etc.. Thanks for the heads up Chris. I was thinking that since most scripts/programs process $PBS_NODEFILE to decide where to run, that this was the logical place to put information about which GPUs to run on. It seems to me that any program that parses $PBS_NODEFILE would need to be updated in order to place itself on a GPU. Please also note that if GPUs are not requested, no change would be made to $PBS_NODEFILE. That being said, if desired, we could create something else to store the GPUs in - perhaps the $PBS_GPUFILE or something. Would that be better? David
> That being said, if desired, we could create something else to store the GPUs > in - perhaps the $PBS_GPUFILE or something. Would that be better? > After considering things, we decided to answer our own question and go with storing things in the $PBS_GPUFILE. Any objections? David
(In reply to comment #13) > > That being said, if desired, we could create something else to store the GPUs > > in - perhaps the $PBS_GPUFILE or something. Would that be better? > > > > After considering things, we decided to answer our own question and go with > storing things in the $PBS_GPUFILE. Any objections? Oh well, I would, but my patch still wasn't accepted. Therefore I will have to maintain my own resource semantics anyway.
Just in case there has been any confusion, the $PBS_NODEFILE I was referring to was the one generated by the mother superior pbs_mom for each job and traditionally just consisted of one hostname per line per vnode allocated to the job. Not the file that defines nodes in the pbs_server.
(In reply to comment #15) > Just in case there has been any confusion, the $PBS_NODEFILE I was referring to > was the one generated by the mother superior pbs_mom for each job and > traditionally just consisted of one hostname per line per vnode allocated to > the job. Not the file that defines nodes in the pbs_server. Oh, yes, I did misunderstood.
(In reply to comment #15) > Just in case there has been any confusion, the $PBS_NODEFILE I was referring to > was the one generated by the mother superior pbs_mom for each job and > traditionally just consisted of one hostname per line per vnode allocated to > the job. Not the file that defines nodes in the pbs_server. I understood. We switched things to $PBS_GPUFILE, and added the environment variable $PBS_GPUFILE to the job's environment. The environment variable is always there, but the file is only created when GPUs are actually requested. David
Could the mom use cuda_wrapper (http://sourceforge.net/projects/cudawrapper/) to set up the environment for the user?
(In reply to comment #17) > (In reply to comment #15) > > Just in case there has been any confusion, the $PBS_NODEFILE I was referring to > > was the one generated by the mother superior pbs_mom for each job and > > traditionally just consisted of one hostname per line per vnode allocated to > > the job. Not the file that defines nodes in the pbs_server. > > I understood. We switched things to $PBS_GPUFILE, and added the environment > variable $PBS_GPUFILE to the job's environment. The environment variable is > always there, but the file is only created when GPUs are actually requested. Looking at the code in 2.5-fixes, how will the program actually know which cards are allocated? Will the ids match the devices?
(In reply to comment #18) > Could the mom use cuda_wrapper (http://sourceforge.net/projects/cudawrapper/) > to set up the environment for the user? This is something we are looking into for improvements to the feature. Most likely we would try to us OpenCL, because we would like to have support for AMD and nvidia cards, even though nvidia dominates the market.
> > Looking at the code in 2.5-fixes, how will the program actually know which > cards are allocated? Will the ids match the devices? Our first pass idea is to do what TORQUE does with cores (ppn, virtual processors, however they should be referred to). An admin is allowed to overload their cores if desired - they can set a 4 core machine to ppn=8 or anything they like. There is also no guarantee that, if they are assigned host/0 (theoretically the 0th core) that the job will actually execute on the 0th core. We can imagine that some site is going to want to overload their gpus, just as some sites do with cpus, and so our initial approach is to handle gpus exactly the same way cores are handled by default. It is up to the user to guarantee that they actually execute on the GPU(s) assigned to their job, by reading the file $PBS_GPUFILE. Eventually, we will add options to lock GPUs to their jobs (like cpusets) and to autodetect the number and types of GPUs on each system. This is something we will eventually do but not something TORQUE can handle at this point.
(In reply to comment #21) > > > > Looking at the code in 2.5-fixes, how will the program actually know which > > cards are allocated? Will the ids match the devices? > > Our first pass idea is to do what TORQUE does with cores (ppn, virtual > processors, however they should be referred to). An admin is allowed to > overload their cores if desired - they can set a 4 core machine to ppn=8 or > anything they like. There is also no guarantee that, if they are assigned > host/0 (theoretically the 0th core) that the job will actually execute on the > 0th core. > > We can imagine that some site is going to want to overload their gpus, just as > some sites do with cpus, and so our initial approach is to handle gpus exactly > the same way cores are handled by default. It is up to the user to guarantee > that they actually execute on the GPU(s) assigned to their job, by reading the > file $PBS_GPUFILE. Eventually, we will add options to lock GPUs to their jobs > (like cpusets) and to autodetect the number and types of GPUs on each system. > This is something we will eventually do but not something TORQUE can handle at > this point. This is actually a different issue. With GPU APIs what you need to do is specify a card upon initialization, therefore the job kind of needs to know which gpus are allocated. What I'm asking is the mapping. Because if there isn't, how will the user know what cards are OK to use?
(In reply to comment #22) > (In reply to comment #21) > > > > > > Looking at the code in 2.5-fixes, how will the program actually know which > > > cards are allocated? Will the ids match the devices? > > > > Our first pass idea is to do what TORQUE does with cores (ppn, virtual > > processors, however they should be referred to). An admin is allowed to > > overload their cores if desired - they can set a 4 core machine to ppn=8 or > > anything they like. There is also no guarantee that, if they are assigned > > host/0 (theoretically the 0th core) that the job will actually execute on the > > 0th core. > > > > We can imagine that some site is going to want to overload their gpus, just as > > some sites do with cpus, and so our initial approach is to handle gpus exactly > > the same way cores are handled by default. It is up to the user to guarantee > > that they actually execute on the GPU(s) assigned to their job, by reading the > > file $PBS_GPUFILE. Eventually, we will add options to lock GPUs to their jobs > > (like cpusets) and to autodetect the number and types of GPUs on each system. > > This is something we will eventually do but not something TORQUE can handle at > > this point. > > This is actually a different issue. With GPU APIs what you need to do is > specify a card upon initialization, therefore the job kind of needs to know > which gpus are allocated. What I'm asking is the mapping. Because if there > isn't, how will the user know what cards are OK to use? Each job has a $PBS_GPUFILE for that job (like the $PBS_NODEFILE for each job). The job script should parse this file for gpus. It contains lines in the format: <hostname>-gpu<index>. This file is meant to have this specification for each job.