[torquedev] [Bug 95] New: Support for GPUs

bugzilla-daemon at supercluster.org bugzilla-daemon at supercluster.org
Thu Nov 4 07:08:27 MDT 2010


http://www.clusterresources.com/bugzilla/show_bug.cgi?id=95

           Summary: Support for GPUs
           Product: TORQUE
           Version: 3.0.0-alpha
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: enhancement
          Priority: P5
         Component: pbs_server
        AssignedTo: glen.beane at gmail.com
        ReportedBy: SimonT at mail.muni.cz
                CC: torquedev at supercluster.org
   Estimated Hours: 0.0


It seems that all that is needed for exclusive GPU access is changing the
ownership of the graphic card device (for nvidia: /dev/nvidiaX).

http://stackoverflow.com/questions/4077790/limiting-access-to-resources-for-cuda-and-opencl

If this is true, then we can support the homogeneous use case of GPUs very
simply (different cards in one machine require much more server and node
logic).

Counted resources are supported by Bug 67, that ensures correct assignment of
jobs requesting GPUs.

As for the node part, I see two possible approaches:

1) modifying the linux mom_mach.c file (presumably in mom_set_limits) to
correctly find and chown the corresponding GPUs. I'm not sure about the cleanup
part (maybe kill_task).

2) doing the GPU assignment/cleanup in the prologue/epilogue. Node code only
sets environment variable GPU_COUNT, GPU_LIST (or similar).

The second one might be preferred because it would allow users of Torque to
easily write/modify their own implementations of the GPU assignment. Therefore
making it easy to port it for a different GPU API.

-- 
Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.


More information about the torquedev mailing list