[torqueusers] Trying to get gpu support enabled with Torque 2.5.9

Jagga S jagga13 at gmail.com
Tue Oct 1 22:32:28 MDT 2013


Yes, I did reinstall pbs server as well.  What about the following error:

> qsub -I -l nodes=1:ppn=1:gpus=2
>  
> --
> PBS_Server: LOG_ERROR::Undefined attribute  (15002) in send_job, child failed in previous commit request for job 7173.xx
> --

I can't seem to submit a job when asking for those resources and how do I make sure that jobs are spread across all available GPU's instead of all jobs going to the very first GPU.

Thanks.
-J

> On Oct 1, 2013, at 8:56 PM, "Andrus, Brian Contractor" <bdandrus at nps.edu> wrote:
> 
> That’s all you will see: gpus=x
>  
> Did you reinstall pbs_server as well?
>  
>  
> Brian Andrus
> ITACS/Research Computing
> Naval Postgraduate School
> Monterey, California
> voice: 831-656-6238
>  
>  
>  
> From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Jagga Soorma
> Sent: Tuesday, October 01, 2013 11:45 AM
> To: torqueusers at supercluster.org
> Subject: [torqueusers] Trying to get gpu support enabled with Torque 2.5.9
>  
> Hi Guys,
>  
> I have a need to enable gpu support on my existing cluster and I have spun up a new test environment with the same Torque 2.5.9 version and configured it the following way:
>  
> On the server (does not have any gpus):
> ./configure --enable-nvidia-gpus --with-debug --with-nvidia-gpus
> make 
> make install
>  
> update the config files and started pbs_sched & pbs_server
>  
> On the client (this has 3 GPU's - Tesla M2050s)
> ./configure -with-debug --enable-nvidia-gpus --with-nvml-lib=/var/tmp/Tesla_Deployment_Kit/tdk_3.304.5/nvml/lib64 --with-nvml-include=/v
> ar/tmp/Tesla_Deployment_Kit/tdk_3.304.5/nvml/include
> make
> make rpm
>  
> then installed the torque and torque-client rpm.  Pointed this client to the server and started the pbs_mom daemon.
>  
> On the server this client now shows up as connected and free for use and I can submit a simple interactive job.
>  
> However, I was expecting the pbsnodes command to give me status on the GPU's attached to my clients, but all I see is:
>  
> --
> node1
>      state = free
>      np = 16
>      ntype = cluster
>      status = rectime=1380652415,varattr=,jobs=,state=free,netload=674176243914,gres=,loadave=0.01,ncpus=16,physmem=24730388kb,availmem=48833164kb,totmem=49904200kb,idletime=852,nusers=0,nsessions=? 15201,sessions=? 15201,uname=Linux amber12 2.6.32.54-0.3-default #1 SMP 2012-01-27 17:38:56 +0100 x86_64,opsys=linux
>      gpus = 3
> --
>  
> Also, if I try to submit a job requesting a gpu I get the following error:
>  
> qsub -I -l nodes=1:ppn=1:gpus=2
>  
> --
> PBS_Server: LOG_ERROR::Undefined attribute  (15002) in send_job, child failed in previous commit request for job 7173.xx
> --
>  
> How can I get GPU support enabled?  Am I missing something here.  Also, what I am trying to achieve is to allow torque to better spread jobs across the 3 different GPU's.  Looks like in our current environment it loads up the first GPU and never tries to balance the jobs across the other 2 available GPU's.
>  
> Any help would be appreciated.
>  
> Thanks,
> -J
>  
>  
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131001/634611ff/attachment.html 


More information about the torqueusers mailing list