[torqueusers] Trying to get gpu support enabled with Torque 2.5.9

Eva Hocks hocks at sdsc.edu
Wed Oct 2 11:12:19 MDT 2013




Hi Jagga,

I just set up gpu support in torque 4.2.5. I played around with the
compile options and found the --with-nvml-lib and --with-nvml-include do
not work. It shows the gpu_status for a second after staring pbs_mom and
then the gpus vanish. torque 4.2.5 compiled with only
--enable-nvidia-gpus however shows the gpu details and schedules gpu
jobs just fine

gpu-2-16
     state = free
     np = 32
     properties = batch,gtxtitan
     ntype = cluster
     status =
rectime=1380733587,varattr=,jobs=,state=free,netload=77686187329,gres=,loadave=0.06,ncpus=32,physmem=264488
144kb,availmem=265380468kb,totmem=272876736kb,idletime=1254,nusers=0,nsessions=0,uname=Linux
gpu-2-16.local 2.6.32-358.18.1.el6.x86_64 #1 SMP Wed Aug 28 17:19:38 UTC 2013 x86_64,opsys=linux
     mom_service_port = 15002
     mom_manager_port = 15003
     gpus = 4
     gpu_status = gpu[3]=gpu_id=0000:84:00.0;gpu_product_name=GeForce GTX TITAN;gpu_display=N/A;gpu_pci_device_id=100510D
E;gpu_pci_location_id=0000:84:00.0;gpu_fan_speed=30 %;gpu_memory_total=6143 MB;gpu_memory_used=14 MB;gpu_mode=Default;gpu
_state=Unallocated;gpu_utilization=N/A;gpu_memory_utilization=N/A;gpu_ecc_mode=N/A;gpu_single_bit_ecc_errors=N/A;gpu_doub
le_bit_ecc_errors=N/A;gpu_temperature=37 C,gpu[2]=gpu_id=0000:83:00.0;gpu_product_name=GeForce GTX
TITAN;gpu_display=N/A;gpu_pci_device_id=100510DE;gpu_pci_location_id=0000:83:00.0;gpu_fan_speed=30
%;gpu_memory_total=6143 MB;gpu_memory_used=14 MB;gpu_mode=Default;gpu_state=Unallocated;gpu_utilization=N/A;gpu_memory_utilization=N/A;gpu_ecc_mode=N/A;gpu_single_bi
t_ecc_errors=N/A;gpu_double_bit_ecc_errors=N/A;gpu_temperature=41 C,gpu[1]=gpu_id=0000:04:00.0;gpu_product_name=GeForce G
TX TITAN;gpu_display=N/A;gpu_pci_device_id=100510DE;gpu_pci_location_id=0000:04:00.0;gpu_fan_speed=30
%;gpu_memory_total=6143 MB;gpu_memory_used=14 MB;gpu_mode=Default;gpu_state=Unallocated;gpu_utilization=N/A;gpu_memory_utilization=N/A;gpu_e
cc_mode=N/A;gpu_single_bit_ecc_errors=N/A;gpu_double_bit_ecc_errors=N/A;gpu_temperature=38
C,gpu[0]=gpu_id=0000:03:00.0;gpu_product_name=GeForce GTX TITAN;gpu_display=N/A;gpu_pci_device_id=100510DE;gpu_pci_location_id=0000:03:00.0;gpu_fan_spe
ed=30 %;gpu_memory_total=6143 MB;gpu_memory_used=14 MB;gpu_mode=Default;gpu_state=Unallocated;gpu_utilization=N/A;gpu_mem
ory_utilization=N/A;gpu_ecc_mode=N/A;gpu_single_bit_ecc_errors=N/A;gpu_double_bit_ecc_errors=N/A;gpu_temperature=34
C,driver_ver=325.15,timestamp=Wed Oct  2 10:06:27 2013

As for the job requesting gpus torque 4.2.5 will set the gpu in
exclusive_thread if no mode is specified, thus only allowing 1 thread
per gpu. Other options are exclusive_process and shared.


I did not try torque 2.5.9, sorry

-Eva

On Tue, 1 Oct 2013, Jagga Soorma wrote:

> Hi Guys,
>
> I have a need to enable gpu support on my existing cluster and I have spun
> up a new test environment with the same Torque 2.5.9 version and configured
> it the following way:
>
> On the server (does not have any gpus):
> ./configure --enable-nvidia-gpus --with-debug --with-nvidia-gpus
> make
> make install
>
> update the config files and started pbs_sched & pbs_server
>
> On the client (this has 3 GPU's - Tesla M2050s)
> ./configure -with-debug --enable-nvidia-gpus
> --with-nvml-lib=/var/tmp/Tesla_Deployment_Kit/tdk_3.304.5/nvml/lib64
> --with-nvml-include=/v
> ar/tmp/Tesla_Deployment_Kit/tdk_3.304.5/nvml/include
> make
> make rpm
>
> then installed the torque and torque-client rpm.  Pointed this client to
> the server and started the pbs_mom daemon.
>
> On the server this client now shows up as connected and free for use and I
> can submit a simple interactive job.
>
> However, I was expecting the pbsnodes command to give me status on the
> GPU's attached to my clients, but all I see is:
>
> --
> node1
>      state = free
>      np = 16
>      ntype = cluster
>      status =
> rectime=1380652415,varattr=,jobs=,state=free,netload=674176243914,gres=,loadave=0.01,ncpus=16,physmem=24730388kb,availmem=48833164kb,totmem=49904200kb,idletime=852,nusers=0,nsessions=?
> 15201,sessions=? 15201,uname=Linux amber12 2.6.32.54-0.3-default #1 SMP
> 2012-01-27 17:38:56 +0100 x86_64,opsys=linux
>      gpus = 3
> --
>
> Also, if I try to submit a job requesting a gpu I get the following error:
>
> qsub -I -l nodes=1:ppn=1:gpus=2
>
> --
> PBS_Server: LOG_ERROR::Undefined attribute  (15002) in send_job, child
> failed in previous commit request for job 7173.xx
> --
>
> How can I get GPU support enabled?  Am I missing something here.  Also,
> what I am trying to achieve is to allow torque to better spread jobs across
> the 3 different GPU's.  Looks like in our current environment it loads up
> the first GPU and never tries to balance the jobs across the other 2
> available GPU's.
>
> Any help would be appreciated.
>
> Thanks,
> -J
>



More information about the torqueusers mailing list