[torqueusers] Torque MOM 4.2.6 - found bug (MOM won't start with cpusets configured) + solution.

Johny johny2015 at wp.pl
Tue Nov 26 05:26:43 MST 2013


Hello.

I've found some bug with newest release of Torque 4.2.6.

When compiled with options:
./configure --with-default-server=### --with-rcp=/usr/bin/scp 
--enable-cpuset --enable-nvidia-gpus --enable-blcr 
--enable-geometry-requests --enable-unixsockets=no

...if just won't start (mom client), hanging during reading files 
/sys/devices/system/nodes/.../cpulist.

The problem is in function:
/scr/resmom/linux/numa_node.cpp -> void numa_node::parse_cpu_string()
There is a loop, parsing subsequent parts of string (this is line of 
text read from files mentioned above):

LINE 121:
   while (*ptr != '\0')
     {
     prev = strtol(ptr, &ptr, 10);

     if (*ptr == '-')
       {
       ptr++;
       curr = strtol(ptr, &ptr, 10);

       while (prev <= curr)
         {
#ifdef PENABLE_LINUX26_CPUSETS
         if ((MOMConfigUseSMT == 1) ||
             (is_physical_core(prev) == true))
#endif
           {
           this->cpu_indices.push_back(prev);
           this->cpu_avail.push_back(true);
           this->total_cpus++;
           this->available_cpus++;
           }

         prev++;
         }

       if (*ptr == ',')
         ptr++;
       }
     else if ((*ptr == ',') ||
              (*ptr == '\0'))
       {
#ifdef PENABLE_LINUX26_CPUSETS
       if ((MOMConfigUseSMT == 1) ||
           (is_physical_core(prev) == true))
#endif
         {
         this->cpu_indices.push_back(prev);
         this->cpu_avail.push_back(true);
         this->total_cpus++;
         this->available_cpus++;
         }

       ptr++;
       }
     }

This loop omits character '\0' ending the string: it enters second "if" 
(because *ptr == '\0') and then increments pointer which leads to 
pointer overflow.
Then... in subsequent iterations it just does nothing (because there are 
usually some rubbish data after '\0' and "strtol" cannot parse them so 
the pointer remains the same).

To resolve this problem I've just added one line.

Previous version:
161: }
162:
163:ptr++;

I've hanged to:
161: }
162:
163: if(*ptr == '\n') break;
164: ptr++;

Now, when it enters second if (when *ptr == '\0'), it saves data about 
the core and exits the loop.
Maybe there is more elegant way to do this but this is simple and just 
works (tested).

Regards,
Peter.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131126/5248936c/attachment.html 


More information about the torqueusers mailing list