[torqueusers] Trying to get gpu support enabled with Torque 2.5.9

Andrus, Brian Contractor bdandrus at nps.edu
Tue Oct 1 21:56:54 MDT 2013


That's all you will see: gpus=x

Did you reinstall pbs_server as well?


Brian Andrus
ITACS/Research Computing
Naval Postgraduate School
Monterey, California
voice: 831-656-6238



From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Jagga Soorma
Sent: Tuesday, October 01, 2013 11:45 AM
To: torqueusers at supercluster.org
Subject: [torqueusers] Trying to get gpu support enabled with Torque 2.5.9

Hi Guys,

I have a need to enable gpu support on my existing cluster and I have spun up a new test environment with the same Torque 2.5.9 version and configured it the following way:

On the server (does not have any gpus):
./configure --enable-nvidia-gpus --with-debug --with-nvidia-gpus
make
make install

update the config files and started pbs_sched & pbs_server

On the client (this has 3 GPU's - Tesla M2050s)
./configure -with-debug --enable-nvidia-gpus --with-nvml-lib=/var/tmp/Tesla_Deployment_Kit/tdk_3.304.5/nvml/lib64 --with-nvml-include=/v
ar/tmp/Tesla_Deployment_Kit/tdk_3.304.5/nvml/include
make
make rpm

then installed the torque and torque-client rpm.  Pointed this client to the server and started the pbs_mom daemon.

On the server this client now shows up as connected and free for use and I can submit a simple interactive job.

However, I was expecting the pbsnodes command to give me status on the GPU's attached to my clients, but all I see is:

--
node1
     state = free
     np = 16
     ntype = cluster
     status = rectime=1380652415,varattr=,jobs=,state=free,netload=674176243914,gres=,loadave=0.01,ncpus=16,physmem=24730388kb,availmem=48833164kb,totmem=49904200kb,idletime=852,nusers=0,nsessions=? 15201,sessions=? 15201,uname=Linux amber12 2.6.32.54-0.3-default #1 SMP 2012-01-27 17:38:56 +0100 x86_64,opsys=linux
     gpus = 3
--

Also, if I try to submit a job requesting a gpu I get the following error:

qsub -I -l nodes=1:ppn=1:gpus=2

--
PBS_Server: LOG_ERROR::Undefined attribute  (15002) in send_job, child failed in previous commit request for job 7173.xx
--

How can I get GPU support enabled?  Am I missing something here.  Also, what I am trying to achieve is to allow torque to better spread jobs across the 3 different GPU's.  Looks like in our current environment it loads up the first GPU and never tries to balance the jobs across the other 2 available GPU's.

Any help would be appreciated.

Thanks,
-J


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131002/2bf741a7/attachment-0001.html 


More information about the torqueusers mailing list