<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "http://www.clusterresources.com/bugzilla/bugzilla.dtd">

<bugzilla version="3.4.5"
          urlbase="http://www.clusterresources.com/bugzilla/"
          
          maintainer="support@clusterresources.com"
>

    <bug>
          <bug_id>173</bug_id>
          
          <creation_ts>2012-01-31 20:24:00 -0700</creation_ts>
          <short_desc>[torque-3.0.4] pbs_mom buffer overflow / segfaults when using --enable-nvidia-gpus [with BUG FIX]</short_desc>
          <delta_ts>2012-04-30 09:55:06 -0600</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>TORQUE</product>
          <component>pbs_mom</component>
          <version>3.0.x</version>
          <rep_platform>PC</rep_platform>
          <op_sys>Linux</op_sys>
          <bug_status>NEW</bug_status>
          
          
          
          
          
          
          <priority>P5</priority>
          <bug_severity>critical</bug_severity>
          <target_milestone>---</target_milestone>
          
          
          
          <everconfirmed>1</everconfirmed>
          <reporter name="Nicolas Pinto">nicolas.pinto</reporter>
          <assigned_to name="Ken Nielson">knielson</assigned_to>
          <cc>dbeer</cc>
    
    <cc>rea+maui</cc>
    
    <cc>torquedev</cc>

      

      

      

          <long_desc isprivate="0">
            <who name="Nicolas Pinto">nicolas.pinto</who>
            <bug_when>2012-01-31 20:24:34 -0700</bug_when>
            <thetext>There is a buffer overflow in pbs_mom when using --enable-nvidia-gpus and a large number of GPUs (e.g. 8):

Report:
-------
# with $loglevel 7
$ tail /var/log/messages
Jan 31 22:04:56 munctional6 pbs_mom: LOG_DEBUG::check_nvidia_version_file, Nvidia driver info: NVRM version: NVIDIA UNIX x86_64 Kernel Module 290.10 Wed Nov 16 17:39:29 PST 2011
Jan 31 22:04:56 munctional6 pbs_mom: LOG_DEBUG::gpus, gpus: GPU cmd issued: nvidia-smi -q -x 2&gt;&amp;1
Jan 31 22:04:59 munctional6 kernel: [ 4186.963718] pbs_mom[7497] general protection ip:41dd69 sp:7fff2ccf72b8 error:0 in pbs_mom[400000+56000]

Cause:
------
src/resmom/mom_server.c:2507 only allocates 16 KB whereas the output of &quot;nvidia-smi -q -x 2&gt;&amp;1&quot; is ~24KB

Bug fix:
--------
The following simple patch fixes the issue: 

--- src/resmom/mom_server.c.orig        2012-01-12 17:18:49.000000000 -0500
+++ src/resmom/mom_server.c     2012-01-31 22:24:01.179534519 -0500
@@ -2504,7 +2504,7 @@
   static char id[] = &quot;generate_server_gpustatus_smi&quot;;
 
   char *dataptr, *outptr, *tmpptr1, *tmpptr2, *savptr;
-  char gpu_string[16 * 1024];
+  char gpu_string[32 * 1024];
   int  gpu_modes[32];
   int  have_modes = FALSE;
   int  gpuid = -1;

HTH

Nicolas</thetext>
          </long_desc>
          <long_desc isprivate="0">
            <who name="Eygene Ryabinkin">rea+maui</who>
            <bug_when>2012-01-31 20:59:30 -0700</bug_when>
            <thetext>Not that easy: in reality, gpus() must be fixed to respect the passed buffer_size.  Or, better, it should do memory allocation/reallocation by itself and return the dynamic buffer to the caller to avoid problems with incompletely captured output from NVidia SMI tools.</thetext>
          </long_desc>
          <long_desc isprivate="0">
            <who name="Nicolas Pinto">nicolas.pinto</who>
            <bug_when>2012-01-31 21:35:50 -0700</bug_when>
            <thetext>Agreed. What you suggest would be the correct way to handle this. I just hacked something in a way that is &quot;compatible&quot; with the current implementation.</thetext>
          </long_desc>
          <long_desc isprivate="0">
            <who name="Nicolas Pinto">nicolas.pinto</who>
            <bug_when>2012-04-27 21:25:11 -0600</bug_when>
            <thetext>Any update on this front? Is 4.0.0 vulnerable to this bug?</thetext>
          </long_desc>
          <long_desc isprivate="0">
            <who name="David Beer">dbeer</who>
            <bug_when>2012-04-30 09:55:06 -0600</bug_when>
            <thetext>This hasn&apos;t been re-written to use the dynamic string struct yet, but it should happen soon.</thetext>
          </long_desc>
      
      

    </bug>

</bugzilla>