Bug 95 - Support for GPUs
: Support for GPUs
Status: NEW
Product: TORQUE
pbs_server
: 3.0.0-alpha
: PC Linux
: P5 enhancement
Assigned To: David Beer
:
:
:
  Show dependency treegraph
 
Reported: 2010-11-04 07:08 MDT by Simon Toth
Modified: 2010-11-18 10:04 MST (History)
6 users (show)

See Also:


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description Simon Toth 2010-11-04 07:08:26 MDT
It seems that all that is needed for exclusive GPU access is changing the
ownership of the graphic card device (for nvidia: /dev/nvidiaX).

http://stackoverflow.com/questions/4077790/limiting-access-to-resources-for-cuda-and-opencl

If this is true, then we can support the homogeneous use case of GPUs very
simply (different cards in one machine require much more server and node
logic).

Counted resources are supported by Bug 67, that ensures correct assignment of
jobs requesting GPUs.

As for the node part, I see two possible approaches:

1) modifying the linux mom_mach.c file (presumably in mom_set_limits) to
correctly find and chown the corresponding GPUs. I'm not sure about the cleanup
part (maybe kill_task).

2) doing the GPU assignment/cleanup in the prologue/epilogue. Node code only
sets environment variable GPU_COUNT, GPU_LIST (or similar).

The second one might be preferred because it would allow users of Torque to
easily write/modify their own implementations of the GPU assignment. Therefore
making it easy to port it for a different GPU API.
Comment 1 David Beer 2010-11-04 09:27:38 MDT
This is something that we have been thinking about at Adaptive as well. Our
thoughts for the first cycle through are a little bit differently than is
suggested.

In the first pass, we are planning to report GPUs just like pbs_server does
with processor assignment. We are going to allow the user to specify gpus=X in
the nodes file, and then TORQUE will track which GPU is assigned to which job
and report it. I'm thinking that $PBS_NODEFILE will add a line for each gpu:

hostname gpu<index>

And from there, each job should handle things (like grabbing the correct gpu),
just like TORQUE does by default with cpus. Eventually, we have a number of
features we would like to add, but first we want to release the feature as I
have described it, and once this is used and people have need for more, we will
add those features. I am convinced that this feature will be easy to add to
TORQUE (Ken and I are actually working on it right now and we will be done
soon) and will significantly improve GPU usage for our users. 

From there, some of the features we want to add include the autodetection of
gpus and exclusive access to the assigned gpus.
Comment 2 Simon Toth 2010-11-04 09:43:16 MDT
(In reply to comment #1)
> This is something that we have been thinking about at Adaptive as well. Our
> thoughts for the first cycle through are a little bit differently than is
> suggested.
> 
> In the first pass, we are planning to report GPUs just like pbs_server does
> with processor assignment. We are going to allow the user to specify gpus=X in
> the nodes file, and then TORQUE will track which GPU is assigned to which job
> and report it. I'm thinking that $PBS_NODEFILE will add a line for each gpu:
> 
> hostname gpu<index>
> 
> And from there, each job should handle things (like grabbing the correct gpu),
> just like TORQUE does by default with cpus. Eventually, we have a number of
> features we would like to add, but first we want to release the feature as I
> have described it, and once this is used and people have need for more, we will
> add those features. I am convinced that this feature will be easy to add to
> TORQUE (Ken and I are actually working on it right now and we will be done
> soon) and will significantly improve GPU usage for our users. 
> 
> From there, some of the features we want to add include the autodetection of
> gpus and exclusive access to the assigned gpus.

Just to clarify, when you are talking about cpus, you mean cpus (from PBSPro)
or ppn?

Anyway, to be absolutely blunt. I don't like this approach very much. Doesn't
look very generic.
Comment 3 David Beer 2010-11-04 10:21:46 MDT
> Just to clarify, when you are talking about cpus, you mean cpus (from PBSPro)
> or ppn?
> 

I'm talking about ppn/np. GPUs would be specified in the same way that you
specify the ppn allowed on that node, which is when you do np=X in the nodes
file.

> Anyway, to be absolutely blunt. I don't like this approach very much. Doesn't
> look very generic.

Its okay if you don't like it, but please be more specific. This approach seems
about as generic as possible to me, so without more details there isn't much
for me to do with that.
Comment 4 Simon Toth 2010-11-04 10:30:28 MDT
(In reply to comment #3)
> > Anyway, to be absolutely blunt. I don't like this approach very much. Doesn't
> > look very generic.
> 
> Its okay if you don't like it, but please be more specific. This approach seems
> about as generic as possible to me, so without more details there isn't much
> for me to do with that.

Sorry, what I meant is that this sounds like adding a specific code for
handling GPUs. What will be done if for example disk space will be added as
next supported resource? More specific code for disk space?
Comment 5 Ken Nielson 2010-11-04 10:40:09 MDT
>Counted resources are supported by Bug 67, that ensures correct assignment of
>jobs requesting GPUs.

By elevating the GPU to the same level as ppn the GPU is now a counted
resource. Moreover, we can now create a node spec that can specifiy how many
processors and GPUs are needed for a job. For example:

qsub -l nodes=hostA:ppn=2:gpu=1 <job.sh>

This will allocate two np and one gpu on hostA. We can do multiple node
assignments as well.

qsub -l nodes=2:ppn=2:gpu=1+2:ppn=2:gpu=2,mem=4Gb <job.sh>

We have now requested two nodes with two np each and 1 gpu each plus 2 more
nodes with two np and two gpu each.

The configuration and syntax fit easily in the current TORQUE build. It is also
generic as to what a gpu is. 

Later we can add the syntax to qsub to support exclusive access and other
features of gpus. We could also add an auto-detect feature that would populate
each host with the number of gpus available plus report statistics in pbsnodes
for the gpus.

Another advantage of this syntax is that it can fit easily into the existing tm
interface. MPI would not need to make many changes if any at all to manage gpus
on multiple MOMs.
Comment 6 Simon Toth 2010-11-04 10:43:59 MDT
(In reply to comment #5)
> >Counted resources are supported by Bug 67, that ensures correct assignment of
> >jobs requesting GPUs.
> 
> By elevating the GPU to the same level as ppn the GPU is now a counted
> resource. Moreover, we can now create a node spec that can specifiy how many
> processors and GPUs are needed for a job. For example:
> 
> qsub -l nodes=hostA:ppn=2:gpu=1 <job.sh>
> 
> This will allocate two np and one gpu on hostA. We can do multiple node
> assignments as well.
> 
> qsub -l nodes=2:ppn=2:gpu=1+2:ppn=2:gpu=2,mem=4Gb <job.sh>
> 
> We have now requested two nodes with two np each and 1 gpu each plus 2 more
> nodes with two np and two gpu each.
> 
> The configuration and syntax fit easily in the current TORQUE build. It is also
> generic as to what a gpu is. 
> 
> Later we can add the syntax to qsub to support exclusive access and other
> features of gpus. We could also add an auto-detect feature that would populate
> each host with the number of gpus available plus report statistics in pbsnodes
> for the gpus.
> 
> Another advantage of this syntax is that it can fit easily into the existing tm
> interface. MPI would not need to make many changes if any at all to manage gpus
> on multiple MOMs.

Uhm. OK, I'm really sorry but I really don't understand this post. Yes that is
how my patch work. I know that. :-) I wrote it myself, so it would be kind of
weird if I wouldn't. :-)
Comment 7 Ken Nielson 2010-11-04 10:46:32 MDT
(In reply to comment #4)
> (In reply to comment #3)
> 
> Sorry, what I meant is that this sounds like adding a specific code for
> handling GPUs. What will be done if for example disk space will be added as
> next supported resource? More specific code for disk space?

There is a small amount of code that needs to be added to support the gpu. It
makes sense to put this in the nodes file with np because these are both
processing type units. They will be requested similarly as well.

We could go down the path of assigning all resources in the nodes file but the
other resources like disk space, memory and such are not treated the same as
np's within TORQUE.
Comment 8 Simon Toth 2010-11-04 10:56:00 MDT
> > Sorry, what I meant is that this sounds like adding a specific code for
> > handling GPUs. What will be done if for example disk space will be added as
> > next supported resource? More specific code for disk space?
> 
> There is a small amount of code that needs to be added to support the gpu. It
> makes sense to put this in the nodes file with np because these are both
> processing type units. They will be requested similarly as well.

Well, any resource should be put into the nodes file. Specifying the resources
on the server instead on the node makes sense. I personally don't really like
the resources specified in node configuration file (although I guess, some
admins might prefer this).

> We could go down the path of assigning all resources in the nodes file but the
> other resources like disk space, memory and such are not treated the same as
> np's within TORQUE.

Well yes. NP's are treated as slots. That makes sense for CPUs, but not really
for anything else. GPUs can be still treated as slots, but there won't be no
slot mapping (because each slot for NP has its own sub-node).
Comment 9 Paul McIntosh 2010-11-04 21:11:12 MDT
Hi All,

Someone pointed me to this ticket and I have some more info that I will dump as
it may help (or at least prevent some headaches).

Using the current Cuda 3.1 drivers, locking the permissions to one driver will
lock all devices. So you would only be able to lock all or none GPU resources
to a user. 

This has been changed in Cuda 3.2 as NVIDIA also think it is a good idea to be
able to lock individual GPUs to users. Below is a copy of some investigation
work I did into the new feature, more info will be available at
http://developer.nvidia.com/object/cuda_3_2_toolkit_rc.html.

-=snip=-
GPU to process locking (Vizworld "allow admins to lock processes to certain
GPU’s")

Good news - it may be now possible to locked GPU's to user jobs via some chmod
queue scripting

I don't know where VisWorld got the statement as it is not mentioned anywhere
in the release notes... however...

There is talk of 3.2 better handling user access restrictions via permissions
to /dev/nvidiactl

"- a user can now access a subset of GPUs by having RW privileges to
/dev/nvidiactl and RW privileges to only a subset of the /dev/nvidia[0...n]
rather than having the CUDA driver throw an error if you can't access any of
the nodes; devices that a user doesn't have permissions to will not be visible
to the app (think CUDA_VISIBLE_DEVICES version 2.0)"
http://forums.nvidia.com/lofiversion/index.php?t180547.html

Looking at CUDA_VISIBLE_DEVICES in 3.1 we see that the environment variabl can
restrict what devices a user sees
http://developer.download.nvidia.com/compute/cuda/3_1/toolkit/docs/cudatoolkit_release_notes_linux.txt
-=snip=-

I currently have a dual Tesla C1060 node set-up with Cuda 3.2 RC - if you want
me to test anything let me know. 

Cheers,

Paul
paulmc@vpac.org
Comment 10 Chris Samuel 2010-11-07 22:57:21 MST
Can I sound a note of caution about modifying the syntax of $PBS_NODEFILE, I
suspect you will find a lot of people/programs process it on the assumption
that it's just one line per CPU per host and just has the hostname in it.

Things like Fluent, etc..
Comment 11 Simon Toth 2010-11-07 23:48:15 MST
(In reply to comment #10)
> Can I sound a note of caution about modifying the syntax of $PBS_NODEFILE, I
> suspect you will find a lot of people/programs process it on the assumption
> that it's just one line per CPU per host and just has the hostname in it.
> 
> Things like Fluent, etc..

I'm not sure what is planned by the core developers, but my patch only extends
the syntax, while staying on one line:

node.hostname property1 property2 resources_total.mem=4G resources_total.gpu=2
resources_total.gpu_type=tesla

I guess the official patch will work in a similar fashion.
Comment 12 David Beer 2010-11-08 08:59:08 MST
(In reply to comment #10)
> Can I sound a note of caution about modifying the syntax of $PBS_NODEFILE, I
> suspect you will find a lot of people/programs process it on the assumption
> that it's just one line per CPU per host and just has the hostname in it.
> 
> Things like Fluent, etc..

Thanks for the heads up Chris. I was thinking that since most scripts/programs
process $PBS_NODEFILE to decide where to run, that this was the logical place
to put information about which GPUs to run on. It seems to me that any program
that parses $PBS_NODEFILE would need to be updated in order to place itself on
a GPU. Please also note that if GPUs are not requested, no change would be made
to $PBS_NODEFILE. 

That being said, if desired, we could create something else to store the GPUs
in - perhaps the $PBS_GPUFILE or something. Would that be better?

David
Comment 13 David Beer 2010-11-08 10:19:17 MST
> That being said, if desired, we could create something else to store the GPUs
> in - perhaps the $PBS_GPUFILE or something. Would that be better?
> 

After considering things, we decided to answer our own question and go with
storing things in the $PBS_GPUFILE. Any objections?

David
Comment 14 Simon Toth 2010-11-08 11:44:47 MST
(In reply to comment #13)
> > That being said, if desired, we could create something else to store the GPUs
> > in - perhaps the $PBS_GPUFILE or something. Would that be better?
> > 
> 
> After considering things, we decided to answer our own question and go with
> storing things in the $PBS_GPUFILE. Any objections?

Oh well, I would, but my patch still wasn't accepted. Therefore I will have to
maintain my own resource semantics anyway.
Comment 15 Chris Samuel 2010-11-08 22:50:41 MST
Just in case there has been any confusion, the $PBS_NODEFILE I was referring to
was the one generated by the mother superior pbs_mom for each job and
traditionally just consisted of one hostname per line per vnode allocated to
the job.  Not the file that defines nodes in the pbs_server.
Comment 16 Simon Toth 2010-11-10 02:54:57 MST
(In reply to comment #15)
> Just in case there has been any confusion, the $PBS_NODEFILE I was referring to
> was the one generated by the mother superior pbs_mom for each job and
> traditionally just consisted of one hostname per line per vnode allocated to
> the job.  Not the file that defines nodes in the pbs_server.

Oh, yes, I did misunderstood.
Comment 17 David Beer 2010-11-10 09:02:40 MST
(In reply to comment #15)
> Just in case there has been any confusion, the $PBS_NODEFILE I was referring to
> was the one generated by the mother superior pbs_mom for each job and
> traditionally just consisted of one hostname per line per vnode allocated to
> the job.  Not the file that defines nodes in the pbs_server.

I understood. We switched things to $PBS_GPUFILE, and added the environment
variable $PBS_GPUFILE to the job's environment. The environment variable is
always there, but the file is only created when GPUs are actually requested.

David
Comment 18 Andrew Keen 2010-11-17 17:21:00 MST
Could the mom use cuda_wrapper (http://sourceforge.net/projects/cudawrapper/)
to set up the environment for the user?
Comment 19 Simon Toth 2010-11-18 08:25:07 MST
(In reply to comment #17)
> (In reply to comment #15)
> > Just in case there has been any confusion, the $PBS_NODEFILE I was referring to
> > was the one generated by the mother superior pbs_mom for each job and
> > traditionally just consisted of one hostname per line per vnode allocated to
> > the job.  Not the file that defines nodes in the pbs_server.
> 
> I understood. We switched things to $PBS_GPUFILE, and added the environment
> variable $PBS_GPUFILE to the job's environment. The environment variable is
> always there, but the file is only created when GPUs are actually requested.

Looking at the code in 2.5-fixes, how will the program actually know which
cards are allocated? Will the ids match the devices?
Comment 20 David Beer 2010-11-18 09:13:29 MST
(In reply to comment #18)
> Could the mom use cuda_wrapper (http://sourceforge.net/projects/cudawrapper/)
> to set up the environment for the user?

This is something we are looking into for improvements to the feature. Most
likely we would try to us OpenCL, because we would like to have support for AMD
and nvidia cards, even though nvidia dominates the market.
Comment 21 David Beer 2010-11-18 09:19:27 MST
> 
> Looking at the code in 2.5-fixes, how will the program actually know which
> cards are allocated? Will the ids match the devices?

Our first pass idea is to do what TORQUE does with cores (ppn, virtual
processors, however they should be referred to). An admin is allowed to
overload their cores if desired - they can set a 4 core machine to ppn=8 or
anything they like. There is also no guarantee that, if they are assigned
host/0 (theoretically the 0th core) that the job will actually execute on the
0th core.

We can imagine that some site is going to want to overload their gpus, just as
some sites do with cpus, and so our initial approach is to handle gpus exactly
the same way cores are handled by default. It is up to the user to guarantee
that they actually execute on the GPU(s) assigned to their job, by reading the
file $PBS_GPUFILE. Eventually, we will add options to lock GPUs to their jobs
(like cpusets) and to autodetect the number and types of GPUs on each system.
This is something we will eventually do but not something TORQUE can handle at
this point.
Comment 22 Simon Toth 2010-11-18 10:00:41 MST
(In reply to comment #21)
> > 
> > Looking at the code in 2.5-fixes, how will the program actually know which
> > cards are allocated? Will the ids match the devices?
> 
> Our first pass idea is to do what TORQUE does with cores (ppn, virtual
> processors, however they should be referred to). An admin is allowed to
> overload their cores if desired - they can set a 4 core machine to ppn=8 or
> anything they like. There is also no guarantee that, if they are assigned
> host/0 (theoretically the 0th core) that the job will actually execute on the
> 0th core.
> 
> We can imagine that some site is going to want to overload their gpus, just as
> some sites do with cpus, and so our initial approach is to handle gpus exactly
> the same way cores are handled by default. It is up to the user to guarantee
> that they actually execute on the GPU(s) assigned to their job, by reading the
> file $PBS_GPUFILE. Eventually, we will add options to lock GPUs to their jobs
> (like cpusets) and to autodetect the number and types of GPUs on each system.
> This is something we will eventually do but not something TORQUE can handle at
> this point.

This is actually a different issue. With GPU APIs what you need to do is
specify a card upon initialization, therefore the job kind of needs to know
which gpus are allocated. What I'm asking is the mapping. Because if there
isn't, how will the user know what cards are OK to use?
Comment 23 David Beer 2010-11-18 10:04:50 MST
(In reply to comment #22)
> (In reply to comment #21)
> > > 
> > > Looking at the code in 2.5-fixes, how will the program actually know which
> > > cards are allocated? Will the ids match the devices?
> > 
> > Our first pass idea is to do what TORQUE does with cores (ppn, virtual
> > processors, however they should be referred to). An admin is allowed to
> > overload their cores if desired - they can set a 4 core machine to ppn=8 or
> > anything they like. There is also no guarantee that, if they are assigned
> > host/0 (theoretically the 0th core) that the job will actually execute on the
> > 0th core.
> > 
> > We can imagine that some site is going to want to overload their gpus, just as
> > some sites do with cpus, and so our initial approach is to handle gpus exactly
> > the same way cores are handled by default. It is up to the user to guarantee
> > that they actually execute on the GPU(s) assigned to their job, by reading the
> > file $PBS_GPUFILE. Eventually, we will add options to lock GPUs to their jobs
> > (like cpusets) and to autodetect the number and types of GPUs on each system.
> > This is something we will eventually do but not something TORQUE can handle at
> > this point.
> 
> This is actually a different issue. With GPU APIs what you need to do is
> specify a card upon initialization, therefore the job kind of needs to know
> which gpus are allocated. What I'm asking is the mapping. Because if there
> isn't, how will the user know what cards are OK to use?

Each job has a $PBS_GPUFILE for that job (like the $PBS_NODEFILE for each job).
The job script should parse this file for gpus. It contains lines in the
format:
<hostname>-gpu<index>. This file is meant to have this specification for each
job.