[torqueusers] Trying to get gpu support enabled with Torque 2.5.9

Sreedhar Manchu sm4082 at nyu.edu
Wed Oct 2 14:51:14 MDT 2013


I add CUDA_VISIBLE_DEVICES to users' pbs scripts using the qsub wrapper. If they need to pass the device numbers as an input, then they should always start from zero no matter what their real device numbers are. So, I defined another variable CUDA_DEVICES and asked the users to use this.

As an example, let's say there are 8 GPUs on the host. If user requests 3 gpus, and he/she got 0,1,2. Another user rquests 3 gpus and he/she got 3,4,5.

Then CUDA_VISIBLE_DEVICES for the first user is 0,1,2
CUDA_DEVICES for the first user is 0 1 2

For the second user CUDA_VISIBLE_DEVICES is 3,4,5
CUDA_DEVICES is 0,1,2

Only problem is we allow users logging onto machines where his/her jobs are running. This means, they can login onto the machine and run jobs on other gpus as well.
Another thing is I have asked my users to not to redefine CUDA_VISIBLE_DEVICES in their scripts.

These two functions in my qsub wrapper take care of this.

function define_cuda_devices ()
{
        if [ $user_shell = 'bash' ];then
                echo 'export CUDA_DEVICES=`seq 0 $(($(grep -c $HOSTNAME $PBS_GPUFILE)-1))`'
        elif [ $user_shell = 'tcsh' ];then
                echo 'setenv CUDA_DEVICES `seq 0 $(($(grep -c $HOSTNAME $PBS_GPUFILE)-1))`'
        fi
}

function define_cuda_visible_devices ()
{
        if [ $user_shell = 'bash' ];then
                echo 'export CUDA_VISIBLE_DEVICES=`grep $HOSTNAME $PBS_GPUFILE | awk -F'\''-gpu'\'' '\''{printf A$2;A=","}'\''`'
        elif [ $user_shell = 'tcsh' ];then
                echo 'setenv CUDA_VISIBLE_DEVICES `grep $HOSTNAME $PBS_GPUFILE | awk -F'\''-gpu'\'' '\''{printf A$2;A=","}'\''`'
        fi
}


Sreedhar.



On Oct 2, 2013, at 2:25 PM, Jagga Soorma <jagga13 at gmail.com> wrote:

> Thanks Michael.  That was the problem.  After building it via rpmbuild I was able to get the gpu_status reported.  Any idea how I can make sure our jobs are spread across the gpu's.  Is there anything I need to do during job submission or will it just spread 1 job per gpu?
> 
> Thanks,
> -J
> 
> 
> On Wed, Oct 2, 2013 at 10:45 AM, Michael Jennings <mej at lbl.gov> wrote:
> On Wednesday, 02 October 2013, at 10:38:57 (-0700),
> Jagga Soorma wrote:
> 
> > Thanks for your reply Eva.  I just tried compiling the 4.2.5 release but
> > still can't see the gpu status.  Here is what I have done:
> >
> > --
> > On server did:
> > ./configure --enable-nvidia-gpus --enable-debug
> > make
> > make install
> >
> > On client did:
> > ./configure --enable-nvidia-gpus --enable-debug
> > make
> > make rpm
> > installed the torque and torque-client rpm
> 
> This won't work.  There's currently no GPU support in the spec file.
> You'll need to do:
> 
> rpmbuild --define 'acflags --enable-nvidia-gpus' --with debug -ta torque-4.2.5.tar.gz
> 
> and then install the resulting torque and torque-client RPMs.
> 
> Michael
> 
> --
> Michael Jennings <mej at lbl.gov>
> Senior HPC Systems Engineer
> High-Performance Computing Services
> Lawrence Berkeley National Laboratory
> Bldg 50B-3209E        W: 510-495-2687
> MS 050B-3209          F: 510-486-8615
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131002/81acec53/attachment.html 


More information about the torqueusers mailing list