[torqueusers] Strange beviour of pbs_mom+pbs_server

Vlad Popa vlad at cosy.sbg.ac.at
Mon Aug 29 08:47:39 MDT 2011


Hi !

Problem soved..
It was a  nvidia-driver issue.  If  I'd have used  Version 280 at the 
beginning instead of 270  I'd never run in such headache giving puzzles ..

.. I'm posting this for anyone running into the same difficulties.

Greetings
Vlad Popa

> I am  evaluating the torque-3.0.3 snapshot version 201107121616   on
> Centos 6.x /RHEL 6.x  i7  machines and I have  noticed a strange
> behaviour of the  pbs_mom  and pbs_server. It concerns the nvidia gpu
> support.
>
> The story  so far:
>
> On all my nodes  gpu01-07  I have  GTX 295 or GTX480 GPUs installed,
> with Cuda 4.0 running.
>
> I have compiled  pbs_mom and pbs_server  using "--enable-nvidia-gpus"
> at  "./configure".  The  nodes show up  with "state free" at the
> pbs_server machine (that hostname just  called gpu), but  they  have no
> gpus  listed  in the output.  Setting the parameter gpus=2 or 1 either
> in the nodesfiles  of pbs_server  or via the CLI in qmgr  does not last
> for long  in its state  and  the node  is  resetted again to no gpus.
>
> Below  the output of pbsnodes with 1 node as example ..
>
>
> [root at gpu ~]# pbsnodes
> gpu01
>        state = free
>        np = 8
>        properties = i7,i7-new,gpunode
>        ntype = cluster
>        status =
> rectime=1314628713,varattr=,jobs=,state=free,netload=285010593596,gres=,loadave=2.23,ncpus=8,physmem=16315316kb,availmem=46322024kb,totmem=49083308kb,idletime=3940,nusers=1,nsessions=1,sessions=15929,uname=Linux
> gpu01 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT 2011
> x86_64,opsys=linux
>        mom_service_port = 15002
>        mom_manager_port = 15003
>        gpus = 0
>        gpu_status = driver_ver=UNKNOWN,timestamp=Mon Aug 29 14:43:37 2011
>
> ....
> ....
>
> on the  pbs_mom side  the  output of the  respective pbs_mom log file
> shows:
> 08/29/2011 00:07:46;0002;
> pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to
> server "opsys=linux"
> 08/29/2011 00:07:46;0002;
> pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to
> server "uname=Linux gpu01 2
> .6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT 2011 x86_64"
> 08/29/2011 00:07:46;0002;
> pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to
> server "nsessions=0"
> 08/29/2011 00:07:46;0002;
> pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to
> server "nusers=0"
> 08/29/2011 00:07:46;0002;
> pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to
> server "idletime=202404"
> 08/29/2011 00:07:46;0002;
> pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to
> server "totmem=49083308kb"
> 08/29/2011 00:07:46;0002;
> pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to
> server "availmem=46567492kb
> "
> 08/29/2011 00:07:46;0002;
> pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to
> server "physmem=16315316kb"
> 08/29/2011 00:07:46;0002;
> pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to
> server "ncpus=8"
> 08/29/2011 00:07:46;0002;
> pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to
> server "loadave=0.00"
> 08/29/2011 00:07:46;0002;
> pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to
> server "gres="
> 08/29/2011 00:07:46;0002;
> pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to
> server "netload=28448847080
> 3"
> 08/29/2011 00:07:46;0002;
> pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to
> server "state=free"
> 08/29/2011 00:07:46;0002;
> pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to
> server "jobs= "
> 08/29/2011 00:07:46;0002;
> pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to
> server "varattr= "
> 08/29/2011 00:07:46;0002;   pbs_mom;n/a;mom_server_update_stat;status
> update successfully sent to gpu
> 08/29/2011 00:07:46;0002;
> pbs_mom;n/a;mom_server_all_update_gpustat;composing gpu status update
> for server
> 08/29/2011 00:07:46;0001;   pbs_mom;Svr;pbs_mom;LOG_DEBUG::gpus, gpus:
> GPU cmd issued: nvidia-smi -a -x 2>&1
>                                                  ^^^^^^^^^^
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> 08/29/2011 00:07:46;0002;
> pbs_mom;n/a;mom_server_update_gpustat;mom_server_update_gpustat: sending
> to server "timestamp=Mon
>    Aug 29 00:07:46 2011"
>
>
> using  nvidia-smi -a -x  by myself shows  that  ther definitiveley _are_
> gpus installed (here the GTX295):
>
> [root at gpu01 torque]# nvidia-smi -a -x
> <?xml version="1.0" ?>
> <!DOCTYPE nvsmi_log SYSTEM "./nvsmi.dtd">
> <nvsmi_log>
> <timestamp>Mon Aug 29 14:57:26 2011</timestamp>
> </nvsmi_log>
>
> GPU 0:
>       Product Name        : GeForce GTX 295
>       PCI ID            : 5e010de
>       Temperature        : 70 C
> GPU 1:
>       Product Name        : GeForce GTX 295
>       PCI ID            : 5e010de
>       Temperature        : 77 C
>
>
> As the logfiles show,  I am using Kernel 2.6.32-131.6.1.el6.x86_64 on a
> Intel 7 machine;  I'd like   to use   the  nvml libraries  and headers
> instead, but  I haven't fond them  installed by Cuda on my system.
>
> Has anybody noticed the same or even a clue how to fix this? Any help
> is  appreciated.
>
>
> Greetings from Europe/Austria/Salzburg
>
> Vlad Popa
>
> University of Salzburg
> HPC Computing/FB Computer Science
> Jakob Harringer Str 2
> 5020 Salzburg
> Austria
> Europe
>
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list