[torqueusers] Trying to get gpu support enabled with Torque 2.5.9

Jagga Soorma jagga13 at gmail.com
Wed Oct 2 11:38:57 MDT 2013


Thanks for your reply Eva.  I just tried compiling the 4.2.5 release but
still can't see the gpu status.  Here is what I have done:

--
On server did:
./configure --enable-nvidia-gpus --enable-debug
make
make install

On client did:
./configure --enable-nvidia-gpus --enable-debug
make
make rpm
installed the torque and torque-client rpm

I now see my node but don't see any gpu options besides my gpus defined
resource:

server> pbsnodes -a
node01
     state = free
     np = 16
     ntype = cluster
     status =
rectime=1380735451,varattr=,jobs=,state=free,netload=674812433094,gres=,loadave=0.08,ncpus=16,physmem=24730388kb,availmem=48777992kb,totmem=49904200kb,idletime=1150,nusers=0,nsessions=0,uname=Linux
node01 2.6.32.54-0.3-default #1 SMP 2012-01-27 17:38:56 +0100
x86_64,opsys=linux
     mom_service_port = 15002
     mom_manager_port = 15003
     gpus = 3

node01> nvidia-smi
Wed Oct  2 10:38:13 2013
+------------------------------------------------------+

| NVIDIA-SMI 4.304.54   Driver Version: 304.54         |

|-------------------------------+----------------------+----------------------+
| GPU  Name                     | Bus-Id        Disp.  | Volatile Uncorr.
ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage         | GPU-Util  Compute
M. |
|===============================+======================+======================|
|   0  Tesla M2050              | 0000:06:00.0     Off |
 Off |
| N/A   N/A    P1    N/A /  N/A |   0%    7MB / 3071MB |      0%   E.
Process |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M2050              | 0000:14:00.0     Off |
 Off |
| N/A   N/A    P1    N/A /  N/A |   0%    7MB / 3071MB |      0%   E.
Process |
+-------------------------------+----------------------+----------------------+
|   2  Tesla M2050              | 0000:11:00.0     Off |
 Off |
| N/A   N/A    P1    N/A /  N/A |   0%    7MB / 3071MB |      0%   E.
Process |
+-------------------------------+----------------------+----------------------+


+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU
Memory |
|  GPU       PID  Process name                                     Usage
   |
|=============================================================================|
|  No running compute processes found
  |
+-----------------------------------------------------------------------------+
--

Am I missing some configuration to enable the gpu support?

Thanks,
-J



On Wed, Oct 2, 2013 at 10:12 AM, Eva Hocks <hocks at sdsc.edu> wrote:

>
>
>
> Hi Jagga,
>
> I just set up gpu support in torque 4.2.5. I played around with the
> compile options and found the --with-nvml-lib and --with-nvml-include do
> not work. It shows the gpu_status for a second after staring pbs_mom and
> then the gpus vanish. torque 4.2.5 compiled with only
> --enable-nvidia-gpus however shows the gpu details and schedules gpu
> jobs just fine
>
> gpu-2-16
>      state = free
>      np = 32
>      properties = batch,gtxtitan
>      ntype = cluster
>      status =
>
> rectime=1380733587,varattr=,jobs=,state=free,netload=77686187329,gres=,loadave=0.06,ncpus=32,physmem=264488
>
> 144kb,availmem=265380468kb,totmem=272876736kb,idletime=1254,nusers=0,nsessions=0,uname=Linux
> gpu-2-16.local 2.6.32-358.18.1.el6.x86_64 #1 SMP Wed Aug 28 17:19:38 UTC
> 2013 x86_64,opsys=linux
>      mom_service_port = 15002
>      mom_manager_port = 15003
>      gpus = 4
>      gpu_status = gpu[3]=gpu_id=0000:84:00.0;gpu_product_name=GeForce GTX
> TITAN;gpu_display=N/A;gpu_pci_device_id=100510D
> E;gpu_pci_location_id=0000:84:00.0;gpu_fan_speed=30
> %;gpu_memory_total=6143 MB;gpu_memory_used=14 MB;gpu_mode=Default;gpu
>
> _state=Unallocated;gpu_utilization=N/A;gpu_memory_utilization=N/A;gpu_ecc_mode=N/A;gpu_single_bit_ecc_errors=N/A;gpu_doub
> le_bit_ecc_errors=N/A;gpu_temperature=37
> C,gpu[2]=gpu_id=0000:83:00.0;gpu_product_name=GeForce GTX
>
> TITAN;gpu_display=N/A;gpu_pci_device_id=100510DE;gpu_pci_location_id=0000:83:00.0;gpu_fan_speed=30
> %;gpu_memory_total=6143 MB;gpu_memory_used=14
> MB;gpu_mode=Default;gpu_state=Unallocated;gpu_utilization=N/A;gpu_memory_utilization=N/A;gpu_ecc_mode=N/A;gpu_single_bi
> t_ecc_errors=N/A;gpu_double_bit_ecc_errors=N/A;gpu_temperature=41
> C,gpu[1]=gpu_id=0000:04:00.0;gpu_product_name=GeForce G
> TX
> TITAN;gpu_display=N/A;gpu_pci_device_id=100510DE;gpu_pci_location_id=0000:04:00.0;gpu_fan_speed=30
> %;gpu_memory_total=6143 MB;gpu_memory_used=14
> MB;gpu_mode=Default;gpu_state=Unallocated;gpu_utilization=N/A;gpu_memory_utilization=N/A;gpu_e
>
> cc_mode=N/A;gpu_single_bit_ecc_errors=N/A;gpu_double_bit_ecc_errors=N/A;gpu_temperature=38
> C,gpu[0]=gpu_id=0000:03:00.0;gpu_product_name=GeForce GTX
> TITAN;gpu_display=N/A;gpu_pci_device_id=100510DE;gpu_pci_location_id=0000:03:00.0;gpu_fan_spe
> ed=30 %;gpu_memory_total=6143 MB;gpu_memory_used=14
> MB;gpu_mode=Default;gpu_state=Unallocated;gpu_utilization=N/A;gpu_mem
>
> ory_utilization=N/A;gpu_ecc_mode=N/A;gpu_single_bit_ecc_errors=N/A;gpu_double_bit_ecc_errors=N/A;gpu_temperature=34
> C,driver_ver=325.15,timestamp=Wed Oct  2 10:06:27 2013
>
> As for the job requesting gpus torque 4.2.5 will set the gpu in
> exclusive_thread if no mode is specified, thus only allowing 1 thread
> per gpu. Other options are exclusive_process and shared.
>
>
> I did not try torque 2.5.9, sorry
>
> -Eva
>
> On Tue, 1 Oct 2013, Jagga Soorma wrote:
>
> > Hi Guys,
> >
> > I have a need to enable gpu support on my existing cluster and I have
> spun
> > up a new test environment with the same Torque 2.5.9 version and
> configured
> > it the following way:
> >
> > On the server (does not have any gpus):
> > ./configure --enable-nvidia-gpus --with-debug --with-nvidia-gpus
> > make
> > make install
> >
> > update the config files and started pbs_sched & pbs_server
> >
> > On the client (this has 3 GPU's - Tesla M2050s)
> > ./configure -with-debug --enable-nvidia-gpus
> > --with-nvml-lib=/var/tmp/Tesla_Deployment_Kit/tdk_3.304.5/nvml/lib64
> > --with-nvml-include=/v
> > ar/tmp/Tesla_Deployment_Kit/tdk_3.304.5/nvml/include
> > make
> > make rpm
> >
> > then installed the torque and torque-client rpm.  Pointed this client to
> > the server and started the pbs_mom daemon.
> >
> > On the server this client now shows up as connected and free for use and
> I
> > can submit a simple interactive job.
> >
> > However, I was expecting the pbsnodes command to give me status on the
> > GPU's attached to my clients, but all I see is:
> >
> > --
> > node1
> >      state = free
> >      np = 16
> >      ntype = cluster
> >      status =
> >
> rectime=1380652415,varattr=,jobs=,state=free,netload=674176243914,gres=,loadave=0.01,ncpus=16,physmem=24730388kb,availmem=48833164kb,totmem=49904200kb,idletime=852,nusers=0,nsessions=?
> > 15201,sessions=? 15201,uname=Linux amber12 2.6.32.54-0.3-default #1 SMP
> > 2012-01-27 17:38:56 +0100 x86_64,opsys=linux
> >      gpus = 3
> > --
> >
> > Also, if I try to submit a job requesting a gpu I get the following
> error:
> >
> > qsub -I -l nodes=1:ppn=1:gpus=2
> >
> > --
> > PBS_Server: LOG_ERROR::Undefined attribute  (15002) in send_job, child
> > failed in previous commit request for job 7173.xx
> > --
> >
> > How can I get GPU support enabled?  Am I missing something here.  Also,
> > what I am trying to achieve is to allow torque to better spread jobs
> across
> > the 3 different GPU's.  Looks like in our current environment it loads up
> > the first GPU and never tries to balance the jobs across the other 2
> > available GPU's.
> >
> > Any help would be appreciated.
> >
> > Thanks,
> > -J
> >
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131002/4baaeca2/attachment-0001.html 


More information about the torqueusers mailing list