[torqueusers] TORQUE GPU support

Ken Nielson knielson at adaptivecomputing.com
Fri Sep 6 15:56:12 MDT 2013


On Fri, Sep 6, 2013 at 3:20 PM, Dave Ulrick <d-ulrick at comcast.net> wrote:

> Hi,
>
> We recently upgraded to TORQUE 4.2.4.1 so I took the opportunity to enable
> nVidia GPU support. It provides some GPU statistics in the 'pbsnodes'
> output that my users might find of some use, but I'm having a challenge
> with figuring out how to support it in a reliable and frugal manner.
>
> My issue has to do with how the GPU code gets the stats by calling
> 'nvidia-smi -q' every few seconds. This is reasonably efficient if the
> nVidia driver is using persistence mode. If it isn't, the nVidia driver is
> loaded when nvidia-smi initializes the GPUs, then unloaded when nvidia-smi
> exits. On our compute nodes it takes .2-.5 wall seconds for the driver
> to load. This keeps an otherwise idle compute node busy enough that load
> average can go above 0.20. Persistence mode eliminates the
> driver load/unload overhead, but we're seeing that some user applications
> can fail when they try to initialize the GPUs if the previous application
> didn't properly release them. Such applications work fine when persistence
> mode is disabled.
>
> At this point our options seem to be:
>
> 1. TORQUE GPU support enabled with nVidia persistence mode on: GPU stats
> and minimum driver overhead but questionable application reliability.
>
> 2. TORQUE GPU support disabled with nVidia persistence mode off: no GPU
> stats with increased driver overhead but good application reliability.
>
> 3. TORQUE GPU support enabled with nVidia persistence mode off: GPU stats
> and good app reliability but high driver overhead.
>
> If you're using TORQUE GPU support, have you noticed the issue I'm seeing?
> If so, how have you chosen to cope? If you've enabled nVidia persistence
> mode, how are your users managing to run GPU apps that can reliably
> initialize the GPUs even if the previous GPU app failed to release them
> properly? Our GPU users are using CUDA 5.0.
>
> Thanks,
> Dave
> --
> Dave Ulrick
> d-ulrick at comcast.net
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>

Dave,

The following explains how to get rid of the nvidia-smi call and get torque
to call the api instead.

TORQUE configuration

There are three configuration (./configure) options available for use with
Nvidia GPGPUs:

   - --enable-nvidia-gpus
   - --with-nvml-lib=DIR
   - --with-nvml-include=DIR

--enable-nvidia-gpus is used to enable the new features for the Nvidia
GPGPUs. By default, the pbs_moms use the nvidia_smi command to interface
with the Nvidia GPUs.

./configure --enable-nvidia-gpus

To use the NVML (NVIDIA Management Library) API instead of nvidia-smi,
configure TORQUE using --with-nvml-lib=DIR and --with-nvml-include=DIR.
These commands specify the location of the libnvidia-ml library and the
location of the nvml.h include file.

./configure -with-nvml-lib=/usr/lib

--with-nvml-include=/usr/local/cuda/Tools/NVML

By default, when TORQUE is configured with --enable-nvidia-gpus the
$TORQUE_HOME/nodes file is automatically updated with the correct GPU count
for each MOM node.

GPU modes for NVIDIA 260.x driver

   - 0 – Default - Shared mode available for multiple processes
   - 1 – Exclusive - Only one COMPUTE thread is allowed to run on the GPU
   - 2 – Prohibited - No COMPUTE contexts are allowed to run on the GPU

GPU Modes for NVIDIA 270.x driver

   - 0 – Default - Shared mode available for multiple processes
   - 1 – Exclusive Thread - Only one COMPUTE thread is allowed to run on
   the GPU (v260 exclusive)
   - 2 – Prohibited - No COMPUTE contexts are allowed to run on the GPU
   - 3 – Exclusive Process - Only one COMPUTE process is allowed to run on
   the GPU



-- 
Ken Nielson
+1 801.717.3700 office +1 801.717.3738 fax
1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
www.adaptivecomputing.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130906/d83a4df6/attachment.html 


More information about the torqueusers mailing list