[torqueusers] Performance of non-GPU codes on GPU nodes reduced by nvidia-smi overhead
David Beer
dbeer at adaptivecomputing.com
Fri Feb 17 13:10:41 MST 2012
Doug,
I have created a ticket for our documentation team to note that the TDK is
where nvml.h can be found.
We also thank you for the patch. I believe there is some more work that
needs to be done beyond just this change, but we will look to get those
done very soon. I think it would be ideal to allow people to use the same
binary for both GPU enabled and non-GPU enabled nodes.
David
On Thu, Feb 16, 2012 at 1:49 PM, Doug Johnson <djohnson at osc.edu> wrote:
> Axel, thanks for the clarification. David, can you update the
> documentation to clarify that the Tesla Deployment Kit is needed to
> for nvml.h? The TDK is not linked to from the normal CUDA download
> pages, and are a bit obscure.
>
> However, when this option is enabled (at least in torque-2.5.10),
> pbs_mom will immediately exit if the node does not have a gpu.
> Clusters that have a mix of GPU and non-GPU nodes are common. Could
> we do something like the following instead?
>
> --- mom_server.c~ 2012-01-12 16:34:39.000000000 -0500
> +++ mom_server.c 2012-02-16 14:51:17.480860518 -0500
> @@ -1255,7 +1255,7 @@
>
> rc = nvmlInit();
>
> - if (rc == NVML_SUCCESS)
> + if (rc == NVML_SUCCESS || rc == NVML_ERROR_DRIVER_NOT_LOADED)
> return (TRUE);
>
> log_nvml_error (rc, NULL, id);
>
> This would allow systems without GPUs to start the same mom as the GPU
> nodes. Ideally the API would also have an error such as
> NVML_ERROR_NO_DEVICE that would be returned if no nvidia devices
> existed in the system (check for pci devices, don't rely on driver
> initialization failure as that's ambiguous.)
>
> Doug
>
>
> At Wed, 15 Feb 2012 18:56:36 -0500,
> Axel Kohlmeyer wrote:
> >
> > On Wed, Feb 15, 2012 at 6:54 PM, Doug Johnson <djohnson at osc.edu> wrote:
> > > Hi David,
> > >
> > > I was going to send a separate email about '--with-nvml-include' once
> > > I had more time to look at the problem. It seems that nvml.h no
> > > longer exists in the newer versions of the CUDA SDK. We have version
> >
> > http://developer.nvidia.com/nvidia-management-library-NVML
> >
> > axel.
> >
> > > 4.1.28 of both the gpucomputingsdk and cudatoolkit, there is no nvml.h
> > > and enabling this option in torque results in failure to build. I
> > > Haven't had a chance to take a look at older versions or the release
> > > notes for descriptions of when this changed.
> > >
> > > Is it safe to assume that if we were able to use this code, a context
> > > to the cards would be kept open by the mom?
> > >
> > > Doug
> > >
> > > At Wed, 15 Feb 2012 16:22:09 -0700,
> > > David Beer wrote:
> > >>
> > >> [1 <multipart/alternative (7bit)>]
> > >> [1.1 <text/plain; ISO-8859-1 (7bit)>]
> > >>
> > >> [1.2 <text/html; ISO-8859-1 (quoted-printable)>]
> > >> Doug,
> > >>
> > >> Have you tried using the --with-nvml-include=<path> option in
> configure? This has pbs_mom use the
> > >> nvidia API for these calls, and should speed things up a bit. The
> path should be the path to the nvml.h
> > >> file and is usually:
> > >> /usr/local/cuda/CUDAToolsSDK/NVML/
> > >>
> > >> David
> > >>
> > >> On Wed, Feb 15, 2012 at 4:15 PM, Doug Johnson <djohnson at osc.edu>
> wrote:
> > >>
> > >> Hi,
> > >>
> > >> Has anyone noticed the overhead when enabling GPU support in
> torque?
> > >> The nvidia-smi process requires about 4 cpu seconds for each
> > >> invocation. When executing a non-GPU code that uses all the cores
> > >> this results in a bit of oversubscription of the cores. Since
> > >> nvidia-smi is executed every 30 seconds to collect card state this
> > >> results in a measurable decrease in performance.
> > >>
> > >> As a workaround I've enabled 'persistence mode' for the card.
> When
> > >> not in use, the card is apparently not initialized. With
> persistence
> > >> mode enabled the cpu time to execute the command is reduced to
> ~0.02.
> > >> This will also help with the execution time of short kernels, as
> the
> > >> card will be ready to go.
> > >>
> > >> Do other people run with persistence mode enabled? Are there any
> > >> downsides?
> > >>
> > >> Doug
> > >>
> > >> PS. I think if X were running this would not be an issue.
> > >> _______________________________________________
> > >> torqueusers mailing list
> > >> torqueusers at supercluster.org
> > >> http://www.supercluster.org/mailman/listinfo/torqueusers
> > >>
> > >> --
> > >> David Beer | Software Engineer
> > >> Adaptive Computing
> > >>
> > >>
> > >> [2 <text/plain; us-ascii (7bit)>]
> > >> _______________________________________________
> > >> torqueusers mailing list
> > >> torqueusers at supercluster.org
> > >> http://www.supercluster.org/mailman/listinfo/torqueusers
> > > _______________________________________________
> > > torqueusers mailing list
> > > torqueusers at supercluster.org
> > > http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> >
> >
> > --
> > Dr. Axel Kohlmeyer akohlmey at gmail.com
> > http://sites.google.com/site/akohlmey/
> >
> > Institute for Computational Molecular Science
> > Temple University, Philadelphia PA, USA.
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
--
David Beer | Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120217/6c9f3e05/attachment.html
More information about the torqueusers
mailing list