[torqueusers] TORQUE GPU support

Dave Ulrick d-ulrick at comcast.net
Fri Sep 6 15:20:37 MDT 2013


We recently upgraded to TORQUE so I took the opportunity to enable 
nVidia GPU support. It provides some GPU statistics in the 'pbsnodes' 
output that my users might find of some use, but I'm having a challenge 
with figuring out how to support it in a reliable and frugal manner.

My issue has to do with how the GPU code gets the stats by calling 
'nvidia-smi -q' every few seconds. This is reasonably efficient if the 
nVidia driver is using persistence mode. If it isn't, the nVidia driver is 
loaded when nvidia-smi initializes the GPUs, then unloaded when nvidia-smi 
exits. On our compute nodes it takes .2-.5 wall seconds for the driver 
to load. This keeps an otherwise idle compute node busy enough that load 
average can go above 0.20. Persistence mode eliminates the 
driver load/unload overhead, but we're seeing that some user applications 
can fail when they try to initialize the GPUs if the previous application 
didn't properly release them. Such applications work fine when persistence 
mode is disabled.

At this point our options seem to be:

1. TORQUE GPU support enabled with nVidia persistence mode on: GPU stats 
and minimum driver overhead but questionable application reliability.

2. TORQUE GPU support disabled with nVidia persistence mode off: no GPU 
stats with increased driver overhead but good application reliability.

3. TORQUE GPU support enabled with nVidia persistence mode off: GPU stats 
and good app reliability but high driver overhead.

If you're using TORQUE GPU support, have you noticed the issue I'm seeing? 
If so, how have you chosen to cope? If you've enabled nVidia persistence 
mode, how are your users managing to run GPU apps that can reliably 
initialize the GPUs even if the previous GPU app failed to release them 
properly? Our GPU users are using CUDA 5.0.

Dave Ulrick
d-ulrick at comcast.net

More information about the torqueusers mailing list