[torqueusers] NVIDIA GPUs version error

Ken Nielson knielson at adaptivecomputing.com
Wed Aug 24 06:29:52 MDT 2011



----- Original Message -----
> From: "Steve Crusan" <scrusan at ur.rochester.edu>
> To: "Torque Users Mailing List" <torqueusers at supercluster.org>
> Sent: Monday, August 22, 2011 2:06:00 PM
> Subject: [torqueusers] NVIDIA GPUs version error
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Hi all,
> 
> I'm getting errors in my syslog from our gpu nodes pbs_moms:
> 
> Aug 22 15:55:09 blugpu07 pbs_mom: LOG_ERROR::a system error occured
> (15205) in generate_server_gpustatus_smi, Unknown Nvidia driver
> version
> 
> Here is the snipped output of pbsnodes blugpu07:
> <SNIPPED>
> gpu_status =
> gpu[1]=gpu_id=0:15:0;,gpu[0]=gpu_id=0:14:0;,driver_ver=275.09.07,timestamp=Mon
> Aug 22 15:56:41 2011
> 
> 
> If I login to the node, and check the pbs_mom logfiles, I see the
> following:
> 
> 08/22/2011 15:57:24;0002;
> pbs_mom;n/a;mom_server_all_update_gpustat;composing gpu status update
> for server
> 08/22/2011 15:57:24;0001; pbs_mom;Svr;pbs_mom;LOG_DEBUG::gpus, gpus:
> GPU cmd issued: nvidia-smi -a -x 2>&1
> 08/22/2011 15:57:26;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::a system
> error occured (15205) in generate_server_gpustatus_smi, Unknown Nvidia
> driver versio n
> 08/22/2011 15:57:26;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::a system
> error occured (15205) in generate_server_gpustatus_smi, Unknown Nvidia
> driver versio n
> 08/22/2011 15:57:26;0002;
> pbs_mom;n/a;mom_server_update_gpustat;mom_server_update_gpustat:
> sending to server "timestamp=Mon Aug 22 15:57:26 2011"
> 08/22/2011 15:57:26;0002;
> pbs_mom;n/a;mom_server_update_gpustat;mom_server_update_gpustat:
> sending to server "driver_ver=275.09.07"
> 08/22/2011 15:57:26;0002;
> pbs_mom;n/a;mom_server_update_gpustat;mom_server_update_gpustat:
> sending to server "gpuid=0:14:0"
> 08/22/2011 15:57:26;0002;
> pbs_mom;n/a;mom_server_update_gpustat;mom_server_update_gpustat:
> sending to server "gpuid=0:15:0"
> 08/22/2011 15:57:26;0002; pbs_mom;n/a;mom_server_update_gpustat;status
> update successfully sent to bhsn-int
> 
> 
> Is this driver version we have not supported by torque?
> 
> 
> 
> Environment:
> - TORQUE-2.5.6
> - NVIDIA Driver Version : 275.09.07
> - kernel: 2.6.18-238.12.1.el5
> 
> - TORQUE client was build via:
> This build was configured with: '''--prefix=/opt/torque/2.5.6'
> '--exec-prefix=/opt/torque/2.5.6/x86_64'
> '--with-server-home=/var/spool/pbs' '--enable-syslog' '--with-scp'
> '--disable-rpp' '--disable-spool' '--with-pam' '--with-cpusets'
> '--with-geometry-requests' '--disable-gui' '--enable-nvidia-gpus'
> '--enable-docs'
> 
> 
> 
> ----------------------
> Steve Crusan
> System Administrator
> Center for Research Computing
> University of Rochester
> https://www.crc.rochester.edu/

Steve,

I am sorry to be so slow to respond. I will have Al Taufer respond to you when he gets back from vacation. He was the developer for the NVIDIA GPUs support and will be able to give you an answer for this error.

Regards

Ken Nielson
Adaptive Computing


More information about the torqueusers mailing list