[torqueusers] Performance of non-GPU codes on GPU nodes reduced by nvidia-smi overhead

Axel Kohlmeyer akohlmey at cmm.chem.upenn.edu
Wed Feb 15 16:56:36 MST 2012


On Wed, Feb 15, 2012 at 6:54 PM, Doug Johnson <djohnson at osc.edu> wrote:
> Hi David,
>
> I was going to send a separate email about '--with-nvml-include' once
> I had more time to look at the problem.  It seems that nvml.h no
> longer exists in the newer versions of the CUDA SDK.  We have version

http://developer.nvidia.com/nvidia-management-library-NVML

axel.

> 4.1.28 of both the gpucomputingsdk and cudatoolkit, there is no nvml.h
> and enabling this option in torque results in failure to build.  I
> Haven't had a chance to take a look at older versions or the release
> notes for descriptions of when this changed.
>
> Is it safe to assume that if we were able to use this code, a context
> to the cards would be kept open by the mom?
>
> Doug
>
> At Wed, 15 Feb 2012 16:22:09 -0700,
> David Beer wrote:
>>
>> [1  <multipart/alternative (7bit)>]
>> [1.1  <text/plain; ISO-8859-1 (7bit)>]
>>
>> [1.2  <text/html; ISO-8859-1 (quoted-printable)>]
>> Doug,
>>
>> Have you tried using the --with-nvml-include=<path> option in configure? This has pbs_mom use the
>> nvidia API for these calls, and should speed things up a bit. The path should be the path to the nvml.h
>> file and is usually:
>> /usr/local/cuda/CUDAToolsSDK/NVML/
>>
>> David
>>
>> On Wed, Feb 15, 2012 at 4:15 PM, Doug Johnson <djohnson at osc.edu> wrote:
>>
>>     Hi,
>>
>>     Has anyone noticed the overhead when enabling GPU support in torque?
>>     The nvidia-smi process requires about 4 cpu seconds for each
>>     invocation.  When executing a non-GPU code that uses all the cores
>>     this results in a bit of oversubscription of the cores.  Since
>>     nvidia-smi is executed every 30 seconds to collect card state this
>>     results in a measurable decrease in performance.
>>
>>     As a workaround I've enabled 'persistence mode' for the card.  When
>>     not in use, the card is apparently not initialized.  With persistence
>>     mode enabled the cpu time to execute the command is reduced to ~0.02.
>>     This will also help with the execution time of short kernels, as the
>>     card will be ready to go.
>>
>>     Do other people run with persistence mode enabled?  Are there any
>>     downsides?
>>
>>     Doug
>>
>>     PS. I think if X were running this would not be an issue.
>>     _______________________________________________
>>     torqueusers mailing list
>>     torqueusers at supercluster.org
>>     http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>> --
>> David Beer | Software Engineer
>> Adaptive Computing
>>
>>
>> [2  <text/plain; us-ascii (7bit)>]
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



-- 
Dr. Axel Kohlmeyer    akohlmey at gmail.com
http://sites.google.com/site/akohlmey/

Institute for Computational Molecular Science
Temple University, Philadelphia PA, USA.


More information about the torqueusers mailing list