[torquedev] [Bug 173] New: [torque-3.0.4] pbs_mom buffer overflow / segfaults when using --enable-nvidia-gpus [with BUG FIX]
bugzilla-daemon at supercluster.org
bugzilla-daemon at supercluster.org
Tue Jan 31 20:24:35 MST 2012
http://www.clusterresources.com/bugzilla/show_bug.cgi?id=173
Summary: [torque-3.0.4] pbs_mom buffer overflow / segfaults
when using --enable-nvidia-gpus [with BUG FIX]
Product: TORQUE
Version: 3.0.x
Platform: PC
OS/Version: Linux
Status: NEW
Severity: enhancement
Priority: P5
Component: pbs_mom
AssignedTo: knielson at adaptivecomputing.com
ReportedBy: nicolas.pinto at gmail.com
CC: torquedev at supercluster.org
Estimated Hours: 0.0
There is a buffer overflow in pbs_mom when using --enable-nvidia-gpus and a
large number of GPUs (e.g. 8):
Report:
-------
# with $loglevel 7
$ tail /var/log/messages
Jan 31 22:04:56 munctional6 pbs_mom: LOG_DEBUG::check_nvidia_version_file,
Nvidia driver info: NVRM version: NVIDIA UNIX x86_64 Kernel Module 290.10 Wed
Nov 16 17:39:29 PST 2011
Jan 31 22:04:56 munctional6 pbs_mom: LOG_DEBUG::gpus, gpus: GPU cmd issued:
nvidia-smi -q -x 2>&1
Jan 31 22:04:59 munctional6 kernel: [ 4186.963718] pbs_mom[7497] general
protection ip:41dd69 sp:7fff2ccf72b8 error:0 in pbs_mom[400000+56000]
Cause:
------
src/resmom/mom_server.c:2507 only allocates 16 KB whereas the output of
"nvidia-smi -q -x 2>&1" is ~24KB
Bug fix:
--------
The following simple patch fixes the issue:
--- src/resmom/mom_server.c.orig 2012-01-12 17:18:49.000000000 -0500
+++ src/resmom/mom_server.c 2012-01-31 22:24:01.179534519 -0500
@@ -2504,7 +2504,7 @@
static char id[] = "generate_server_gpustatus_smi";
char *dataptr, *outptr, *tmpptr1, *tmpptr2, *savptr;
- char gpu_string[16 * 1024];
+ char gpu_string[32 * 1024];
int gpu_modes[32];
int have_modes = FALSE;
int gpuid = -1;
HTH
Nicolas
--
Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
More information about the torquedev
mailing list