[torqueusers] Strange beviour of pbs_mom+pbs_server

Vlad Popa vlad at cosy.sbg.ac.at
Mon Aug 29 07:09:40 MDT 2011


Greetings !

I am  evaluating the torque-3.0.3 snapshot version 201107121616   on  
Centos 6.x /RHEL 6.x  i7  machines and I have  noticed a strange 
behaviour of the  pbs_mom  and pbs_server. It concerns the nvidia gpu 
support.

The story  so far:

On all my nodes  gpu01-07  I have  GTX 295 or GTX480 GPUs installed, 
with Cuda 4.0 running.

I have compiled  pbs_mom and pbs_server  using "--enable-nvidia-gpus" 
at  "./configure".  The  nodes show up  with "state free" at the 
pbs_server machine (that hostname just  called gpu), but  they  have no 
gpus  listed  in the output.  Setting the parameter gpus=2 or 1 either 
in the nodesfiles  of pbs_server  or via the CLI in qmgr  does not last 
for long  in its state  and  the node  is  resetted again to no gpus.

Below  the output of pbsnodes with 1 node as example ..


[root at gpu ~]# pbsnodes
gpu01
      state = free
      np = 8
      properties = i7,i7-new,gpunode
      ntype = cluster
      status = 
rectime=1314628713,varattr=,jobs=,state=free,netload=285010593596,gres=,loadave=2.23,ncpus=8,physmem=16315316kb,availmem=46322024kb,totmem=49083308kb,idletime=3940,nusers=1,nsessions=1,sessions=15929,uname=Linux 
gpu01 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT 2011 
x86_64,opsys=linux
      mom_service_port = 15002
      mom_manager_port = 15003
      gpus = 0
      gpu_status = driver_ver=UNKNOWN,timestamp=Mon Aug 29 14:43:37 2011

....
....

on the  pbs_mom side  the  output of the  respective pbs_mom log file 
shows:
08/29/2011 00:07:46;0002;   
pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to 
server "opsys=linux"
08/29/2011 00:07:46;0002;   
pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to 
server "uname=Linux gpu01 2
.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT 2011 x86_64"
08/29/2011 00:07:46;0002;   
pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to 
server "nsessions=0"
08/29/2011 00:07:46;0002;   
pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to 
server "nusers=0"
08/29/2011 00:07:46;0002;   
pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to 
server "idletime=202404"
08/29/2011 00:07:46;0002;   
pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to 
server "totmem=49083308kb"
08/29/2011 00:07:46;0002;   
pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to 
server "availmem=46567492kb
"
08/29/2011 00:07:46;0002;   
pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to 
server "physmem=16315316kb"
08/29/2011 00:07:46;0002;   
pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to 
server "ncpus=8"
08/29/2011 00:07:46;0002;   
pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to 
server "loadave=0.00"
08/29/2011 00:07:46;0002;   
pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to 
server "gres="
08/29/2011 00:07:46;0002;   
pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to 
server "netload=28448847080
3"
08/29/2011 00:07:46;0002;   
pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to 
server "state=free"
08/29/2011 00:07:46;0002;   
pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to 
server "jobs= "
08/29/2011 00:07:46;0002;   
pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to 
server "varattr= "
08/29/2011 00:07:46;0002;   pbs_mom;n/a;mom_server_update_stat;status 
update successfully sent to gpu
08/29/2011 00:07:46;0002;   
pbs_mom;n/a;mom_server_all_update_gpustat;composing gpu status update 
for server
08/29/2011 00:07:46;0001;   pbs_mom;Svr;pbs_mom;LOG_DEBUG::gpus, gpus: 
GPU cmd issued: nvidia-smi -a -x 2>&1
                                                ^^^^^^^^^^ 
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

08/29/2011 00:07:46;0002;   
pbs_mom;n/a;mom_server_update_gpustat;mom_server_update_gpustat: sending 
to server "timestamp=Mon
  Aug 29 00:07:46 2011"


using  nvidia-smi -a -x  by myself shows  that  ther definitiveley _are_ 
gpus installed (here the GTX295):

[root at gpu01 torque]# nvidia-smi -a -x
<?xml version="1.0" ?>
<!DOCTYPE nvsmi_log SYSTEM "./nvsmi.dtd">
<nvsmi_log>
<timestamp>Mon Aug 29 14:57:26 2011</timestamp>
</nvsmi_log>

GPU 0:
     Product Name        : GeForce GTX 295
     PCI ID            : 5e010de
     Temperature        : 70 C
GPU 1:
     Product Name        : GeForce GTX 295
     PCI ID            : 5e010de
     Temperature        : 77 C


As the logfiles show,  I am using Kernel 2.6.32-131.6.1.el6.x86_64 on a 
Intel 7 machine;  I'd like   to use   the  nvml libraries  and headers 
instead, but  I haven't fond them  installed by Cuda on my system.

Has anybody noticed the same or even a clue how to fix this? Any help 
is  appreciated.


Greetings from Europe/Austria/Salzburg

Vlad Popa

University of Salzburg
HPC Computing/FB Computer Science
Jakob Harringer Str 2
5020 Salzburg
Austria
Europe





More information about the torqueusers mailing list