[torquedev] [Bug 95] Support for GPUs

bugzilla-daemon at supercluster.org bugzilla-daemon at supercluster.org
Thu Nov 4 21:11:12 MDT 2010


http://www.clusterresources.com/bugzilla/show_bug.cgi?id=95

Paul McIntosh <paulmc at vpac.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |paulmc at vpac.org

--- Comment #9 from Paul McIntosh <paulmc at vpac.org> 2010-11-04 21:11:12 MDT ---
Hi All,

Someone pointed me to this ticket and I have some more info that I will dump as
it may help (or at least prevent some headaches).

Using the current Cuda 3.1 drivers, locking the permissions to one driver will
lock all devices. So you would only be able to lock all or none GPU resources
to a user. 

This has been changed in Cuda 3.2 as NVIDIA also think it is a good idea to be
able to lock individual GPUs to users. Below is a copy of some investigation
work I did into the new feature, more info will be available at
http://developer.nvidia.com/object/cuda_3_2_toolkit_rc.html.

-=snip=-
GPU to process locking (Vizworld "allow admins to lock processes to certain
GPU’s")

Good news - it may be now possible to locked GPU's to user jobs via some chmod
queue scripting

I don't know where VisWorld got the statement as it is not mentioned anywhere
in the release notes... however...

There is talk of 3.2 better handling user access restrictions via permissions
to /dev/nvidiactl

"- a user can now access a subset of GPUs by having RW privileges to
/dev/nvidiactl and RW privileges to only a subset of the /dev/nvidia[0...n]
rather than having the CUDA driver throw an error if you can't access any of
the nodes; devices that a user doesn't have permissions to will not be visible
to the app (think CUDA_VISIBLE_DEVICES version 2.0)"
http://forums.nvidia.com/lofiversion/index.php?t180547.html

Looking at CUDA_VISIBLE_DEVICES in 3.1 we see that the environment variabl can
restrict what devices a user sees
http://developer.download.nvidia.com/compute/cuda/3_1/toolkit/docs/cudatoolkit_release_notes_linux.txt
-=snip=-

I currently have a dual Tesla C1060 node set-up with Cuda 3.2 RC - if you want
me to test anything let me know. 

Cheers,

Paul
paulmc at vpac.org

-- 
Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.


More information about the torquedev mailing list