[torqueusers] Strange beviour of pbs_mom+pbs_server
Vlad Popa
vlad at cosy.sbg.ac.at
Mon Aug 29 08:47:39 MDT 2011
Hi !
Problem soved..
It was a nvidia-driver issue. If I'd have used Version 280 at the
beginning instead of 270 I'd never run in such headache giving puzzles ..
.. I'm posting this for anyone running into the same difficulties.
Greetings
Vlad Popa
> I am evaluating the torque-3.0.3 snapshot version 201107121616 on
> Centos 6.x /RHEL 6.x i7 machines and I have noticed a strange
> behaviour of the pbs_mom and pbs_server. It concerns the nvidia gpu
> support.
>
> The story so far:
>
> On all my nodes gpu01-07 I have GTX 295 or GTX480 GPUs installed,
> with Cuda 4.0 running.
>
> I have compiled pbs_mom and pbs_server using "--enable-nvidia-gpus"
> at "./configure". The nodes show up with "state free" at the
> pbs_server machine (that hostname just called gpu), but they have no
> gpus listed in the output. Setting the parameter gpus=2 or 1 either
> in the nodesfiles of pbs_server or via the CLI in qmgr does not last
> for long in its state and the node is resetted again to no gpus.
>
> Below the output of pbsnodes with 1 node as example ..
>
>
> [root at gpu ~]# pbsnodes
> gpu01
> state = free
> np = 8
> properties = i7,i7-new,gpunode
> ntype = cluster
> status =
> rectime=1314628713,varattr=,jobs=,state=free,netload=285010593596,gres=,loadave=2.23,ncpus=8,physmem=16315316kb,availmem=46322024kb,totmem=49083308kb,idletime=3940,nusers=1,nsessions=1,sessions=15929,uname=Linux
> gpu01 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT 2011
> x86_64,opsys=linux
> mom_service_port = 15002
> mom_manager_port = 15003
> gpus = 0
> gpu_status = driver_ver=UNKNOWN,timestamp=Mon Aug 29 14:43:37 2011
>
> ....
> ....
>
> on the pbs_mom side the output of the respective pbs_mom log file
> shows:
> 08/29/2011 00:07:46;0002;
> pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to
> server "opsys=linux"
> 08/29/2011 00:07:46;0002;
> pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to
> server "uname=Linux gpu01 2
> .6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT 2011 x86_64"
> 08/29/2011 00:07:46;0002;
> pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to
> server "nsessions=0"
> 08/29/2011 00:07:46;0002;
> pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to
> server "nusers=0"
> 08/29/2011 00:07:46;0002;
> pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to
> server "idletime=202404"
> 08/29/2011 00:07:46;0002;
> pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to
> server "totmem=49083308kb"
> 08/29/2011 00:07:46;0002;
> pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to
> server "availmem=46567492kb
> "
> 08/29/2011 00:07:46;0002;
> pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to
> server "physmem=16315316kb"
> 08/29/2011 00:07:46;0002;
> pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to
> server "ncpus=8"
> 08/29/2011 00:07:46;0002;
> pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to
> server "loadave=0.00"
> 08/29/2011 00:07:46;0002;
> pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to
> server "gres="
> 08/29/2011 00:07:46;0002;
> pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to
> server "netload=28448847080
> 3"
> 08/29/2011 00:07:46;0002;
> pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to
> server "state=free"
> 08/29/2011 00:07:46;0002;
> pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to
> server "jobs= "
> 08/29/2011 00:07:46;0002;
> pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to
> server "varattr= "
> 08/29/2011 00:07:46;0002; pbs_mom;n/a;mom_server_update_stat;status
> update successfully sent to gpu
> 08/29/2011 00:07:46;0002;
> pbs_mom;n/a;mom_server_all_update_gpustat;composing gpu status update
> for server
> 08/29/2011 00:07:46;0001; pbs_mom;Svr;pbs_mom;LOG_DEBUG::gpus, gpus:
> GPU cmd issued: nvidia-smi -a -x 2>&1
> ^^^^^^^^^^
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> 08/29/2011 00:07:46;0002;
> pbs_mom;n/a;mom_server_update_gpustat;mom_server_update_gpustat: sending
> to server "timestamp=Mon
> Aug 29 00:07:46 2011"
>
>
> using nvidia-smi -a -x by myself shows that ther definitiveley _are_
> gpus installed (here the GTX295):
>
> [root at gpu01 torque]# nvidia-smi -a -x
> <?xml version="1.0" ?>
> <!DOCTYPE nvsmi_log SYSTEM "./nvsmi.dtd">
> <nvsmi_log>
> <timestamp>Mon Aug 29 14:57:26 2011</timestamp>
> </nvsmi_log>
>
> GPU 0:
> Product Name : GeForce GTX 295
> PCI ID : 5e010de
> Temperature : 70 C
> GPU 1:
> Product Name : GeForce GTX 295
> PCI ID : 5e010de
> Temperature : 77 C
>
>
> As the logfiles show, I am using Kernel 2.6.32-131.6.1.el6.x86_64 on a
> Intel 7 machine; I'd like to use the nvml libraries and headers
> instead, but I haven't fond them installed by Cuda on my system.
>
> Has anybody noticed the same or even a clue how to fix this? Any help
> is appreciated.
>
>
> Greetings from Europe/Austria/Salzburg
>
> Vlad Popa
>
> University of Salzburg
> HPC Computing/FB Computer Science
> Jakob Harringer Str 2
> 5020 Salzburg
> Austria
> Europe
>
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
More information about the torqueusers
mailing list