[torqueusers] Strange beviour of pbs_mom+pbs_server
Vlad Popa
vlad at cosy.sbg.ac.at
Mon Aug 29 07:09:40 MDT 2011
Greetings !
I am evaluating the torque-3.0.3 snapshot version 201107121616 on
Centos 6.x /RHEL 6.x i7 machines and I have noticed a strange
behaviour of the pbs_mom and pbs_server. It concerns the nvidia gpu
support.
The story so far:
On all my nodes gpu01-07 I have GTX 295 or GTX480 GPUs installed,
with Cuda 4.0 running.
I have compiled pbs_mom and pbs_server using "--enable-nvidia-gpus"
at "./configure". The nodes show up with "state free" at the
pbs_server machine (that hostname just called gpu), but they have no
gpus listed in the output. Setting the parameter gpus=2 or 1 either
in the nodesfiles of pbs_server or via the CLI in qmgr does not last
for long in its state and the node is resetted again to no gpus.
Below the output of pbsnodes with 1 node as example ..
[root at gpu ~]# pbsnodes
gpu01
state = free
np = 8
properties = i7,i7-new,gpunode
ntype = cluster
status =
rectime=1314628713,varattr=,jobs=,state=free,netload=285010593596,gres=,loadave=2.23,ncpus=8,physmem=16315316kb,availmem=46322024kb,totmem=49083308kb,idletime=3940,nusers=1,nsessions=1,sessions=15929,uname=Linux
gpu01 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT 2011
x86_64,opsys=linux
mom_service_port = 15002
mom_manager_port = 15003
gpus = 0
gpu_status = driver_ver=UNKNOWN,timestamp=Mon Aug 29 14:43:37 2011
....
....
on the pbs_mom side the output of the respective pbs_mom log file
shows:
08/29/2011 00:07:46;0002;
pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to
server "opsys=linux"
08/29/2011 00:07:46;0002;
pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to
server "uname=Linux gpu01 2
.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT 2011 x86_64"
08/29/2011 00:07:46;0002;
pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to
server "nsessions=0"
08/29/2011 00:07:46;0002;
pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to
server "nusers=0"
08/29/2011 00:07:46;0002;
pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to
server "idletime=202404"
08/29/2011 00:07:46;0002;
pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to
server "totmem=49083308kb"
08/29/2011 00:07:46;0002;
pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to
server "availmem=46567492kb
"
08/29/2011 00:07:46;0002;
pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to
server "physmem=16315316kb"
08/29/2011 00:07:46;0002;
pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to
server "ncpus=8"
08/29/2011 00:07:46;0002;
pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to
server "loadave=0.00"
08/29/2011 00:07:46;0002;
pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to
server "gres="
08/29/2011 00:07:46;0002;
pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to
server "netload=28448847080
3"
08/29/2011 00:07:46;0002;
pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to
server "state=free"
08/29/2011 00:07:46;0002;
pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to
server "jobs= "
08/29/2011 00:07:46;0002;
pbs_mom;n/a;mom_server_update_stat;mom_server_update_stat: sending to
server "varattr= "
08/29/2011 00:07:46;0002; pbs_mom;n/a;mom_server_update_stat;status
update successfully sent to gpu
08/29/2011 00:07:46;0002;
pbs_mom;n/a;mom_server_all_update_gpustat;composing gpu status update
for server
08/29/2011 00:07:46;0001; pbs_mom;Svr;pbs_mom;LOG_DEBUG::gpus, gpus:
GPU cmd issued: nvidia-smi -a -x 2>&1
^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
08/29/2011 00:07:46;0002;
pbs_mom;n/a;mom_server_update_gpustat;mom_server_update_gpustat: sending
to server "timestamp=Mon
Aug 29 00:07:46 2011"
using nvidia-smi -a -x by myself shows that ther definitiveley _are_
gpus installed (here the GTX295):
[root at gpu01 torque]# nvidia-smi -a -x
<?xml version="1.0" ?>
<!DOCTYPE nvsmi_log SYSTEM "./nvsmi.dtd">
<nvsmi_log>
<timestamp>Mon Aug 29 14:57:26 2011</timestamp>
</nvsmi_log>
GPU 0:
Product Name : GeForce GTX 295
PCI ID : 5e010de
Temperature : 70 C
GPU 1:
Product Name : GeForce GTX 295
PCI ID : 5e010de
Temperature : 77 C
As the logfiles show, I am using Kernel 2.6.32-131.6.1.el6.x86_64 on a
Intel 7 machine; I'd like to use the nvml libraries and headers
instead, but I haven't fond them installed by Cuda on my system.
Has anybody noticed the same or even a clue how to fix this? Any help
is appreciated.
Greetings from Europe/Austria/Salzburg
Vlad Popa
University of Salzburg
HPC Computing/FB Computer Science
Jakob Harringer Str 2
5020 Salzburg
Austria
Europe
More information about the torqueusers
mailing list