[torqueusers] Trying to get gpu support enabled with Torque 2.5.9

Jeffrey R. Lang JRLang at uwyo.edu
Wed Oct 2 11:43:15 MDT 2013


I reported a bug with the same. Symptoms in an earlier version.  It was fixed in 4.3.2 version I think.  Sounds like it may be broke again.

Sent from my iPa

On Oct 2, 2013, at 12:39 PM, "Jagga Soorma" <jagga13 at gmail.com<mailto:jagga13 at gmail.com>> wrote:

Thanks for your reply Eva.  I just tried compiling the 4.2.5 release but still can't see the gpu status.  Here is what I have done:

--
On server did:
./configure --enable-nvidia-gpus --enable-debug
make
make install

On client did:
./configure --enable-nvidia-gpus --enable-debug
make
make rpm
installed the torque and torque-client rpm

I now see my node but don't see any gpu options besides my gpus defined resource:

server> pbsnodes -a
node01
     state = free
     np = 16
     ntype = cluster
     status = rectime=1380735451,varattr=,jobs=,state=free,netload=674812433094,gres=,loadave=0.08,ncpus=16,physmem=24730388kb,availmem=48777992kb,totmem=49904200kb,idletime=1150,nusers=0,nsessions=0,uname=Linux node01 2.6.32.54-0.3-default #1 SMP 2012-01-27 17:38:56 +0100 x86_64,opsys=linux
     mom_service_port = 15002
     mom_manager_port = 15003
     gpus = 3

node01> nvidia-smi
Wed Oct  2 10:38:13 2013
+------------------------------------------------------+
| NVIDIA-SMI 4.304.54   Driver Version: 304.54         |
|-------------------------------+----------------------+----------------------+
| GPU  Name                     | Bus-Id        Disp.  | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage         | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M2050              | 0000:06:00.0     Off |                  Off |
| N/A   N/A    P1    N/A /  N/A |   0%    7MB / 3071MB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M2050              | 0000:14:00.0     Off |                  Off |
| N/A   N/A    P1    N/A /  N/A |   0%    7MB / 3071MB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   2  Tesla M2050              | 0000:11:00.0     Off |                  Off |
| N/A   N/A    P1    N/A /  N/A |   0%    7MB / 3071MB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|  No running compute processes found                                         |
+-----------------------------------------------------------------------------+
--

Am I missing some configuration to enable the gpu support?

Thanks,
-J



On Wed, Oct 2, 2013 at 10:12 AM, Eva Hocks <hocks at sdsc.edu<mailto:hocks at sdsc.edu>> wrote:



Hi Jagga,

I just set up gpu support in torque 4.2.5. I played around with the
compile options and found the --with-nvml-lib and --with-nvml-include do
not work. It shows the gpu_status for a second after staring pbs_mom and
then the gpus vanish. torque 4.2.5 compiled with only
--enable-nvidia-gpus however shows the gpu details and schedules gpu
jobs just fine

gpu-2-16
     state = free
     np = 32
     properties = batch,gtxtitan
     ntype = cluster
     status =
rectime=1380733587,varattr=,jobs=,state=free,netload=77686187329,gres=,loadave=0.06,ncpus=32,physmem=264488
144kb,availmem=265380468kb,totmem=272876736kb,idletime=1254,nusers=0,nsessions=0,uname=Linux
gpu-2-16.local 2.6.32-358.18.1.el6.x86_64 #1 SMP Wed Aug 28 17:19:38 UTC 2013 x86_64,opsys=linux
     mom_service_port = 15002
     mom_manager_port = 15003
     gpus = 4
     gpu_status = gpu[3]=gpu_id=0000:84:00.0;gpu_product_name=GeForce GTX TITAN;gpu_display=N/A;gpu_pci_device_id=100510D
E;gpu_pci_location_id=0000:84:00.0;gpu_fan_speed=30 %;gpu_memory_total=6143 MB;gpu_memory_used=14 MB;gpu_mode=Default;gpu
_state=Unallocated;gpu_utilization=N/A;gpu_memory_utilization=N/A;gpu_ecc_mode=N/A;gpu_single_bit_ecc_errors=N/A;gpu_doub
le_bit_ecc_errors=N/A;gpu_temperature=37 C,gpu[2]=gpu_id=0000:83:00.0;gpu_product_name=GeForce GTX
TITAN;gpu_display=N/A;gpu_pci_device_id=100510DE;gpu_pci_location_id=0000:83:00.0;gpu_fan_speed=30
%;gpu_memory_total=6143 MB;gpu_memory_used=14 MB;gpu_mode=Default;gpu_state=Unallocated;gpu_utilization=N/A;gpu_memory_utilization=N/A;gpu_ecc_mode=N/A;gpu_single_bi
t_ecc_errors=N/A;gpu_double_bit_ecc_errors=N/A;gpu_temperature=41 C,gpu[1]=gpu_id=0000:04:00.0;gpu_product_name=GeForce G
TX TITAN;gpu_display=N/A;gpu_pci_device_id=100510DE;gpu_pci_location_id=0000:04:00.0;gpu_fan_speed=30
%;gpu_memory_total=6143 MB;gpu_memory_used=14 MB;gpu_mode=Default;gpu_state=Unallocated;gpu_utilization=N/A;gpu_memory_utilization=N/A;gpu_e
cc_mode=N/A;gpu_single_bit_ecc_errors=N/A;gpu_double_bit_ecc_errors=N/A;gpu_temperature=38
C,gpu[0]=gpu_id=0000:03:00.0;gpu_product_name=GeForce GTX TITAN;gpu_display=N/A;gpu_pci_device_id=100510DE;gpu_pci_location_id=0000:03:00.0;gpu_fan_spe
ed=30 %;gpu_memory_total=6143 MB;gpu_memory_used=14 MB;gpu_mode=Default;gpu_state=Unallocated;gpu_utilization=N/A;gpu_mem
ory_utilization=N/A;gpu_ecc_mode=N/A;gpu_single_bit_ecc_errors=N/A;gpu_double_bit_ecc_errors=N/A;gpu_temperature=34
C,driver_ver=325.15,timestamp=Wed Oct  2 10:06:27 2013

As for the job requesting gpus torque 4.2.5 will set the gpu in
exclusive_thread if no mode is specified, thus only allowing 1 thread
per gpu. Other options are exclusive_process and shared.


I did not try torque 2.5.9, sorry

-Eva

On Tue, 1 Oct 2013, Jagga Soorma wrote:

> Hi Guys,
>
> I have a need to enable gpu support on my existing cluster and I have spun
> up a new test environment with the same Torque 2.5.9 version and configured
> it the following way:
>
> On the server (does not have any gpus):
> ./configure --enable-nvidia-gpus --with-debug --with-nvidia-gpus
> make
> make install
>
> update the config files and started pbs_sched & pbs_server
>
> On the client (this has 3 GPU's - Tesla M2050s)
> ./configure -with-debug --enable-nvidia-gpus
> --with-nvml-lib=/var/tmp/Tesla_Deployment_Kit/tdk_3.304.5/nvml/lib64
> --with-nvml-include=/v
> ar/tmp/Tesla_Deployment_Kit/tdk_3.304.5/nvml/include
> make
> make rpm
>
> then installed the torque and torque-client rpm.  Pointed this client to
> the server and started the pbs_mom daemon.
>
> On the server this client now shows up as connected and free for use and I
> can submit a simple interactive job.
>
> However, I was expecting the pbsnodes command to give me status on the
> GPU's attached to my clients, but all I see is:
>
> --
> node1
>      state = free
>      np = 16
>      ntype = cluster
>      status =
> rectime=1380652415,varattr=,jobs=,state=free,netload=674176243914,gres=,loadave=0.01,ncpus=16,physmem=24730388kb,availmem=48833164kb,totmem=49904200kb,idletime=852,nusers=0,nsessions=?
> 15201,sessions=? 15201,uname=Linux amber12 2.6.32.54-0.3-default #1 SMP
> 2012-01-27 17:38:56 +0100 x86_64,opsys=linux
>      gpus = 3
> --
>
> Also, if I try to submit a job requesting a gpu I get the following error:
>
> qsub -I -l nodes=1:ppn=1:gpus=2
>
> --
> PBS_Server: LOG_ERROR::Undefined attribute  (15002) in send_job, child
> failed in previous commit request for job 7173.xx
> --
>
> How can I get GPU support enabled?  Am I missing something here.  Also,
> what I am trying to achieve is to allow torque to better spread jobs across
> the 3 different GPU's.  Looks like in our current environment it loads up
> the first GPU and never tries to balance the jobs across the other 2
> available GPU's.
>
> Any help would be appreciated.
>
> Thanks,
> -J
>

_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org<mailto:torqueusers at supercluster.org>
http://www.supercluster.org/mailman/listinfo/torqueusers

_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org<mailto:torqueusers at supercluster.org>
http://www.supercluster.org/mailman/listinfo/torqueusers
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131002/15eeff84/attachment.html 


More information about the torqueusers mailing list