[torquedev] managing exclusive access to GPUs
Sam.Moskwa at csiro.au
Sam.Moskwa at csiro.au
Wed Nov 10 17:37:57 MST 2010
As there has been a bit of traffic about GPUs lately it seems appropriate to give a summary of how we currently manage them
There are two issues
1) the scheduler needs to be aware of how many GPUs are available per node
2) we want to restrict access to devices (and probably make this transparent to the user)
The 'solutions' we have in place are:
1) We have a cluster with 128 compute nodes each with 2 Tesla Fermi GPUs.
We are using Torque & Moab so our moab.cfg includes
Users request GPUs with something like
qsub -l nodes=1:ppn=1,gres=gpu jobscript
The main limitation there is that GRES are 'per core' (a better name than gpu might be gpppn for "gpus per processor per node"), so if you need to request ppn > gpppn, the only way is to instead request the whole node. i.e the syntax does not allow 1 gpu and 2+ cpu cores
In any case it lets us handle the most common cases (where each mpi worker uses 1 gpu and the user just submits 2 workers per node) and we can fill the remaining slots on a node with CPU only work.
2) Since we allow multiple jobs to share a node, we need a way to restrict which devices each job has access to.. this can be done with permissions on /dev/nvidia*
One CUDA limitation is that a user needs access to both GPUs on a node or CUDA calls fail.. even if they only try to use one of the devices.
This has apparently been fixed in CUDA/3.2 but I have not had time to see if it works on our system (and possibly using codes linked to pre CUDA/3.2 libraries would continue to fail?)
So if we have two users each requesting a single GPU, and want to schedule them to the same node, we are forced to give both of them permission to access to both GPUs
This means we can't just change the owner of /dev/nvidia* we need to use the group permissions.
So in the job prologue and epilogue we add/remove the user from the 'video' group. This required a minor hack in the Torque source so that it wouldn't use a cached copy of the user's groups when spawning their process but instead rereads the groups after running the prologue (it also works with Torque spawned MPI processes). The prologue and epilogue also need to keep track of which jobs have requested a GPU and only remove a user from the video group when the last of their GPU requesting jobs on a node completes. We manage this by simply adding and removing entries to a gpulock file and only removing the user from the video group when they have no remaining locks.
So all this gives us an environment where only users who request GPUs will have access to the devices and the scheduler won't oversubscribe a node due to gres.
However there is nothing that prevents a user from requesting a single GPU then going ahead and using both. Also there is nothing to tell a user which GPU they should be using.
The CUDA_VISIBLE_DEVICES environment variable is the final piece of the puzzle and was added in CUDA/3.1 The prologue script can't actually set this directly, so one answer might be for it to record the PBS_JOBID and allocated GPU device number somewhere (the gpulock file I mentioned earlier, or a PBS_GPUNODES or whatever) and then have the site .login files extract and set it
It can still be circumvented by a user (unless the split permissions does work in 3.2) but is still an improvement.
More information about the torquedev