[torqueusers] Altix cpusets

Jeroen van den Muyzenberg Jeroen.vandenMuyzenberg at csiro.au
Wed Oct 26 19:07:58 MDT 2005


Hi,

I've had the chance to play on an Altix 3700 before it joins our
existing Altix in production next week and have been experimenting with
using cpusets, with little initial success.

Turns out there were two problems. A cpuset name can be a max of 8
characters, and the string (cQueueName in start_exec.c) holding this
name didn't have the space for the terminating null. Also cQueueName was
initialising with garbage, and the strncpy and strncat used to create
the cpuset name don't append a null terminator if not found in the
source string.

We also intend to start using bootcpusets, and the existing code doesn't
account for that. ie it will start placing jobs from CPU 0 onwards
regardless that this CPU is already in another cpuset.

Attached is a patch that addresses all these issues. For bootcpuset support,
there needs to be a define in pbs_config.h

#define CPUSETS_FIRST_CPU X

where X is the first CPU outside the defined bootcpuset.

Looking forward to seeing this work in production next week.

Further improvements would be the ability to specify the type of memory
access regime for the cpuset, and a better cpu allocation algorithm that
would try to pack multi-cpu jobs onto the same node/brick if at all
possible.

Cheers,
Jeroen

Jeroen van den Muyzenberg
CSIRO High Performance Scientific Computing
Bureau of Meteorology/CSIRO HPCCC -
High Performance Computing and Communications Centre
Ph: +61 3 9669 8111 Fax: +61 3 9669 8112
Jeroen.vandenMuyzenberg at csiro.au
-------------- next part --------------
--- torque-2.0.0p0/src/resmom/start_exec.c.orig	2005-10-27 10:06:06.000000000 +1000
+++ torque-2.0.0p0/src/resmom/start_exec.c	2005-10-27 10:53:45.000000000 +1000
@@ -1347,7 +1347,7 @@
 
 #if defined(PENABLE_DYNAMIC_CPUSETS)
 
-  char                  cQueueName[8];  /* Unique CpuSet Name */
+  char                  cQueueName[9];  /* Unique CpuSet Name */
   char                  cPermFile[1024]; /* Unique File Name */
   FILE                  *fp;            /* file pointer into /proc/cpuinfo */
   char                  cBuffer[CBUFFERSIZE + 1];  /* char buffer used for counting procs */
@@ -1356,7 +1356,7 @@
 
   struct CpuSetMap {
     short CpuId;
-    char  cQueueName[8]; /* Data struct for mapping CpuId to
+    char  cQueueName[9]; /* Data struct for mapping CpuId to
                             CpuSets assignments for the machine */
   } *cpusetMap;
 
@@ -1627,6 +1627,7 @@
           /* CpuSet Name = cpusetList->list[i] */
 
           strncpy(cpusetMap[cpuList->list[j]].cQueueName,cpusetList->list[i],8);
+          cpusetMap[cpuList->list[j]].cQueueName[9] = 0;
           }
         }    /* END for (i) */
       }      /* END else */
@@ -1681,7 +1682,9 @@
     /* Queue Name can only be 3 - 8 chars long */
 
     strncpy(cQueueName,pwdp->pw_name,3);
+    cQueueName[3] = 0;
     strncat(cQueueName,pjob->ji_qs.ji_jobid,5);
+    cQueueName[8] = 0;
 
     /* Set Memory Affinity */
 
@@ -1714,7 +1717,11 @@
 
     j = 0;
 
+#ifdef CPUSETS_FIRST_CPU
+    for (i = CPUSET_FIRST_CPU;i < nCPUS;i++) 
+#else
     for (i = 0;i < nCPUS;i++) 
+#endif
       {
       if (j >= presc->rs_value.at_val.at_long)
         break;


More information about the torqueusers mailing list