[torqueusers] Torque MOM 4.2.6 - found bug (MOM won't start with cpusets configured) + solution.

Johny johny2015 at wp.pl
Tue Nov 26 05:34:55 MST 2013


W dniu 2013-11-26 13:26, Johny pisze:
> Hello.
>
> I've found some bug with newest release of Torque 4.2.6.
>
> When compiled with options:
> ./configure --with-default-server=### --with-rcp=/usr/bin/scp 
> --enable-cpuset --enable-nvidia-gpus --enable-blcr 
> --enable-geometry-requests --enable-unixsockets=no
>
> ...if just won't start (mom client), hanging during reading files 
> /sys/devices/system/nodes/.../cpulist.
>
> The problem is in function:
> /scr/resmom/linux/numa_node.cpp -> void numa_node::parse_cpu_string()
> There is a loop, parsing subsequent parts of string (this is line of 
> text read from files mentioned above):
>
> LINE 121:
>   while (*ptr != '\0')
>     {
>     prev = strtol(ptr, &ptr, 10);
>
>     if (*ptr == '-')
>       {
>       ptr++;
>       curr = strtol(ptr, &ptr, 10);
>
>       while (prev <= curr)
>         {
> #ifdef PENABLE_LINUX26_CPUSETS
>         if ((MOMConfigUseSMT == 1) ||
>             (is_physical_core(prev) == true))
> #endif
>           {
>           this->cpu_indices.push_back(prev);
>           this->cpu_avail.push_back(true);
>           this->total_cpus++;
>           this->available_cpus++;
>           }
>
>         prev++;
>         }
>
>       if (*ptr == ',')
>         ptr++;
>       }
>     else if ((*ptr == ',') ||
>              (*ptr == '\0'))
>       {
> #ifdef PENABLE_LINUX26_CPUSETS
>       if ((MOMConfigUseSMT == 1) ||
>           (is_physical_core(prev) == true))
> #endif
>         {
>         this->cpu_indices.push_back(prev);
>         this->cpu_avail.push_back(true);
>         this->total_cpus++;
>         this->available_cpus++;
>         }
>
>       ptr++;
>       }
>     }
>
> This loop omits character '\0' ending the string: it enters second 
> "if" (because *ptr == '\0') and then increments pointer which leads to 
> pointer overflow.
> Then... in subsequent iterations it just does nothing (because there 
> are usually some rubbish data after '\0' and "strtol" cannot parse 
> them so the pointer remains the same).
>
> To resolve this problem I've just added one line.
>
> Previous version:
> 161: }
> 162:
> 163:ptr++;
>
> I've hanged to:
> 161: }
> 162:
> 163: if(*ptr == '\n') break;
> 164: ptr++;
>
> Now, when it enters second if (when *ptr == '\0'), it saves data about 
> the core and exits the loop.
> Maybe there is more elegant way to do this but this is simple and just 
> works (tested).
>
> Regards,
> Peter.
>
>
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

My mistake.There of course should be:
I've hanged to:
161: }
162:
163: if(*ptr == '\0') break;
164: ptr++;

Sorry, to much work today :(

Regards,
Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131126/121338b5/attachment-0001.html 


More information about the torqueusers mailing list