[torquedev] [Bug 95] Support for GPUs

bugzilla-daemon at supercluster.org bugzilla-daemon at supercluster.org
Thu Nov 18 10:04:50 MST 2010


--- Comment #23 from dbeer at adaptivecomputing.com 2010-11-18 10:04:50 MST ---
(In reply to comment #22)
> (In reply to comment #21)
> > > 
> > > Looking at the code in 2.5-fixes, how will the program actually know which
> > > cards are allocated? Will the ids match the devices?
> > 
> > Our first pass idea is to do what TORQUE does with cores (ppn, virtual
> > processors, however they should be referred to). An admin is allowed to
> > overload their cores if desired - they can set a 4 core machine to ppn=8 or
> > anything they like. There is also no guarantee that, if they are assigned
> > host/0 (theoretically the 0th core) that the job will actually execute on the
> > 0th core.
> > 
> > We can imagine that some site is going to want to overload their gpus, just as
> > some sites do with cpus, and so our initial approach is to handle gpus exactly
> > the same way cores are handled by default. It is up to the user to guarantee
> > that they actually execute on the GPU(s) assigned to their job, by reading the
> > file $PBS_GPUFILE. Eventually, we will add options to lock GPUs to their jobs
> > (like cpusets) and to autodetect the number and types of GPUs on each system.
> > This is something we will eventually do but not something TORQUE can handle at
> > this point.
> This is actually a different issue. With GPU APIs what you need to do is
> specify a card upon initialization, therefore the job kind of needs to know
> which gpus are allocated. What I'm asking is the mapping. Because if there
> isn't, how will the user know what cards are OK to use?

Each job has a $PBS_GPUFILE for that job (like the $PBS_NODEFILE for each job).
The job script should parse this file for gpus. It contains lines in the
<hostname>-gpu<index>. This file is meant to have this specification for each

Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

More information about the torquedev mailing list