Bugzilla – Bug 173
[torque-3.0.4] pbs_mom buffer overflow / segfaults when using --enable-nvidia-gpus [with BUG FIX]
Last modified: 2012-04-30 09:55:06 MDT
You need to log in before you can comment on or make changes to this bug.
There is a buffer overflow in pbs_mom when using --enable-nvidia-gpus and a large number of GPUs (e.g. 8): Report: ------- # with $loglevel 7 $ tail /var/log/messages Jan 31 22:04:56 munctional6 pbs_mom: LOG_DEBUG::check_nvidia_version_file, Nvidia driver info: NVRM version: NVIDIA UNIX x86_64 Kernel Module 290.10 Wed Nov 16 17:39:29 PST 2011 Jan 31 22:04:56 munctional6 pbs_mom: LOG_DEBUG::gpus, gpus: GPU cmd issued: nvidia-smi -q -x 2>&1 Jan 31 22:04:59 munctional6 kernel: [ 4186.963718] pbs_mom[7497] general protection ip:41dd69 sp:7fff2ccf72b8 error:0 in pbs_mom[400000+56000] Cause: ------ src/resmom/mom_server.c:2507 only allocates 16 KB whereas the output of "nvidia-smi -q -x 2>&1" is ~24KB Bug fix: -------- The following simple patch fixes the issue: --- src/resmom/mom_server.c.orig 2012-01-12 17:18:49.000000000 -0500 +++ src/resmom/mom_server.c 2012-01-31 22:24:01.179534519 -0500 @@ -2504,7 +2504,7 @@ static char id[] = "generate_server_gpustatus_smi"; char *dataptr, *outptr, *tmpptr1, *tmpptr2, *savptr; - char gpu_string[16 * 1024]; + char gpu_string[32 * 1024]; int gpu_modes[32]; int have_modes = FALSE; int gpuid = -1; HTH Nicolas
Not that easy: in reality, gpus() must be fixed to respect the passed buffer_size. Or, better, it should do memory allocation/reallocation by itself and return the dynamic buffer to the caller to avoid problems with incompletely captured output from NVidia SMI tools.
Agreed. What you suggest would be the correct way to handle this. I just hacked something in a way that is "compatible" with the current implementation.
Any update on this front? Is 4.0.0 vulnerable to this bug?
This hasn't been re-written to use the dynamic string struct yet, but it should happen soon.