[torqueusers] Performance of non-GPU codes on GPU nodes reduced by nvidia-smi overhead

David Beer dbeer at adaptivecomputing.com
Fri Feb 17 13:10:41 MST 2012


Doug,

I have created a ticket for our documentation team to note that the TDK is
where nvml.h can be found.

We also thank you for the patch. I believe there is some more work that
needs to be done beyond just this change, but we will look to get those
done very soon. I think it would be ideal to allow people to use the same
binary for both GPU enabled and non-GPU enabled nodes.

David

On Thu, Feb 16, 2012 at 1:49 PM, Doug Johnson <djohnson at osc.edu> wrote:

> Axel, thanks for the clarification.  David, can you update the
> documentation to clarify that the Tesla Deployment Kit is needed to
> for nvml.h?  The TDK is not linked to from the normal CUDA download
> pages, and are a bit obscure.
>
> However, when this option is enabled (at least in torque-2.5.10),
> pbs_mom will immediately exit if the node does not have a gpu.
> Clusters that have a mix of GPU and non-GPU nodes are common.  Could
> we do something like the following instead?
>
> --- mom_server.c~       2012-01-12 16:34:39.000000000 -0500
> +++ mom_server.c        2012-02-16 14:51:17.480860518 -0500
> @@ -1255,7 +1255,7 @@
>
>   rc = nvmlInit();
>
> -  if (rc == NVML_SUCCESS)
> +  if (rc == NVML_SUCCESS || rc == NVML_ERROR_DRIVER_NOT_LOADED)
>     return (TRUE);
>
>   log_nvml_error (rc, NULL, id);
>
> This would allow systems without GPUs to start the same mom as the GPU
> nodes.  Ideally the API would also have an error such as
> NVML_ERROR_NO_DEVICE that would be returned if no nvidia devices
> existed in the system (check for pci devices, don't rely on driver
> initialization failure as that's ambiguous.)
>
> Doug
>
>
> At Wed, 15 Feb 2012 18:56:36 -0500,
> Axel Kohlmeyer wrote:
> >
> > On Wed, Feb 15, 2012 at 6:54 PM, Doug Johnson <djohnson at osc.edu> wrote:
> > > Hi David,
> > >
> > > I was going to send a separate email about '--with-nvml-include' once
> > > I had more time to look at the problem.  It seems that nvml.h no
> > > longer exists in the newer versions of the CUDA SDK.  We have version
> >
> > http://developer.nvidia.com/nvidia-management-library-NVML
> >
> > axel.
> >
> > > 4.1.28 of both the gpucomputingsdk and cudatoolkit, there is no nvml.h
> > > and enabling this option in torque results in failure to build.  I
> > > Haven't had a chance to take a look at older versions or the release
> > > notes for descriptions of when this changed.
> > >
> > > Is it safe to assume that if we were able to use this code, a context
> > > to the cards would be kept open by the mom?
> > >
> > > Doug
> > >
> > > At Wed, 15 Feb 2012 16:22:09 -0700,
> > > David Beer wrote:
> > >>
> > >> [1  <multipart/alternative (7bit)>]
> > >> [1.1  <text/plain; ISO-8859-1 (7bit)>]
> > >>
> > >> [1.2  <text/html; ISO-8859-1 (quoted-printable)>]
> > >> Doug,
> > >>
> > >> Have you tried using the --with-nvml-include=<path> option in
> configure? This has pbs_mom use the
> > >> nvidia API for these calls, and should speed things up a bit. The
> path should be the path to the nvml.h
> > >> file and is usually:
> > >> /usr/local/cuda/CUDAToolsSDK/NVML/
> > >>
> > >> David
> > >>
> > >> On Wed, Feb 15, 2012 at 4:15 PM, Doug Johnson <djohnson at osc.edu>
> wrote:
> > >>
> > >>     Hi,
> > >>
> > >>     Has anyone noticed the overhead when enabling GPU support in
> torque?
> > >>     The nvidia-smi process requires about 4 cpu seconds for each
> > >>     invocation.  When executing a non-GPU code that uses all the cores
> > >>     this results in a bit of oversubscription of the cores.  Since
> > >>     nvidia-smi is executed every 30 seconds to collect card state this
> > >>     results in a measurable decrease in performance.
> > >>
> > >>     As a workaround I've enabled 'persistence mode' for the card.
>  When
> > >>     not in use, the card is apparently not initialized.  With
> persistence
> > >>     mode enabled the cpu time to execute the command is reduced to
> ~0.02.
> > >>     This will also help with the execution time of short kernels, as
> the
> > >>     card will be ready to go.
> > >>
> > >>     Do other people run with persistence mode enabled?  Are there any
> > >>     downsides?
> > >>
> > >>     Doug
> > >>
> > >>     PS. I think if X were running this would not be an issue.
> > >>     _______________________________________________
> > >>     torqueusers mailing list
> > >>     torqueusers at supercluster.org
> > >>     http://www.supercluster.org/mailman/listinfo/torqueusers
> > >>
> > >> --
> > >> David Beer | Software Engineer
> > >> Adaptive Computing
> > >>
> > >>
> > >> [2  <text/plain; us-ascii (7bit)>]
> > >> _______________________________________________
> > >> torqueusers mailing list
> > >> torqueusers at supercluster.org
> > >> http://www.supercluster.org/mailman/listinfo/torqueusers
> > > _______________________________________________
> > > torqueusers mailing list
> > > torqueusers at supercluster.org
> > > http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> >
> >
> > --
> > Dr. Axel Kohlmeyer    akohlmey at gmail.com
> > http://sites.google.com/site/akohlmey/
> >
> > Institute for Computational Molecular Science
> > Temple University, Philadelphia PA, USA.
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>



-- 
David Beer | Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120217/6c9f3e05/attachment.html 


More information about the torqueusers mailing list