[torqueusers] Trying to get gpu support enabled with Torque 2.5.9
Andrus, Brian Contractor
bdandrus at nps.edu
Tue Oct 1 21:56:54 MDT 2013
That's all you will see: gpus=x
Did you reinstall pbs_server as well?
Naval Postgraduate School
From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Jagga Soorma
Sent: Tuesday, October 01, 2013 11:45 AM
To: torqueusers at supercluster.org
Subject: [torqueusers] Trying to get gpu support enabled with Torque 2.5.9
I have a need to enable gpu support on my existing cluster and I have spun up a new test environment with the same Torque 2.5.9 version and configured it the following way:
On the server (does not have any gpus):
./configure --enable-nvidia-gpus --with-debug --with-nvidia-gpus
update the config files and started pbs_sched & pbs_server
On the client (this has 3 GPU's - Tesla M2050s)
./configure -with-debug --enable-nvidia-gpus --with-nvml-lib=/var/tmp/Tesla_Deployment_Kit/tdk_3.304.5/nvml/lib64 --with-nvml-include=/v
then installed the torque and torque-client rpm. Pointed this client to the server and started the pbs_mom daemon.
On the server this client now shows up as connected and free for use and I can submit a simple interactive job.
However, I was expecting the pbsnodes command to give me status on the GPU's attached to my clients, but all I see is:
state = free
np = 16
ntype = cluster
status = rectime=1380652415,varattr=,jobs=,state=free,netload=674176243914,gres=,loadave=0.01,ncpus=16,physmem=24730388kb,availmem=48833164kb,totmem=49904200kb,idletime=852,nusers=0,nsessions=? 15201,sessions=? 15201,uname=Linux amber12 126.96.36.199-0.3-default #1 SMP 2012-01-27 17:38:56 +0100 x86_64,opsys=linux
gpus = 3
Also, if I try to submit a job requesting a gpu I get the following error:
qsub -I -l nodes=1:ppn=1:gpus=2
PBS_Server: LOG_ERROR::Undefined attribute (15002) in send_job, child failed in previous commit request for job 7173.xx
How can I get GPU support enabled? Am I missing something here. Also, what I am trying to achieve is to allow torque to better spread jobs across the 3 different GPU's. Looks like in our current environment it loads up the first GPU and never tries to balance the jobs across the other 2 available GPU's.
Any help would be appreciated.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the torqueusers