[torqueusers] Torque behaving strangely, jobs becoming blockedindefinitely

Gareth Williams wil240 at csiro.au
Sun Aug 26 22:11:33 MDT 2007


Hi Leigh,

As discussed offline, I have a patch for resmom/linux/mom_mach.c 
(attached) to make it only look for processes in cpusets, rather than 
trawling repeatedly through /proc.  This should suit you with a little 
tweaking, given that you have your own torque_submitfilter cpusets scheme.

For the list,  note that this will only work for a single node setup as 
toque_submitfilter can only easily influence the environment of the 
jobscript, whereas in a multi-node environmnet, tasks started on other 
nodes with tm_spawn cannot be as easily placed in a job-temporary cpuset. 
I think for a more general fix, a revised integration of cpusets into 
pbs_mom/tm_spawn is needed.

cheers,

Gareth Williams, CSIRO HPSC

On Thu, 23 Aug 2007, Leigh Gordon wrote:

> Hi everyone,
>
> This problem has just manifested itself again, in a different way. Users' jobs
> are actually being queued, and then being killed, rather than sitting in the
> BLOCKED queue as before.

-snip-

> Even when this finishes and the strace goes back to a sane speed(just
> reading /proc/loadavg every now and then), the node reports as down for a
> while, then comes back up, only to be taken down by MOM again!
>
> Is there ANY way to stop the system from crawling to a halt and killing jobs
> as soon as they are submitted? Even if people's jobs are queued and checked
> less frequently, it would be preferable to the jobs being killed for no
> reason and having to try and resubmit them hoping that it will work this
> time! Frustrating as you can imagine :)
>
> It seems that even when there's sufficient CPUs free, by the time it's gets
> around to scheduling the job, it's back to it's "/proc trawling" again so
> then can't communicate and the job either enters the queue and then dies, or
> remains in a deferred state.
>
> Any ideas? I realise a proper fix involves rewriting the program for better
> scalability, but does anyone have any ideas to alleviate this problem in the
> meantime? Maybe suggest which values in MOM/maui/pbs_server config can be
> tweaked to lighten to load?
>
> I think you were right on the money David, this /proc business is a bit
> ridiculous and it's ironic that something with this much grunt can be slowed
> to a crawl by a single process!
>
> Thanks
> Leigh
>
> -- 
> Regards,
>
> Leigh Gordon
> High Performance Computing Systems Administrator
> IT Resources, University of Tasmania
> Phone: 03 6226 6389
> http://www.tpac.org.au
>
> "What we're really after is simply that people acquire a legal license for
> Windows for each computer they own before they move on to Linux or Sun
> Solaris or BSD or OS/2 or whatever."
> - Bill Gates -
>
> On Mon, 20 Aug 2007 08:54:55 am Leigh Gordon wrote:
>> Hi, thanks for the reply.
>>
>> I've looked into it a bit more(even going as far as looking at the source
>> code for 'top', which unfortunately also uses "/proc" rather than system
>> calls) and it seems like it's a job for someone with a lot more experience
>> with C than myself. I did find a GPL project called Supermon, which I read
>> a paper on(and it specifically said it used system calls rather than /proc
>> on Linux) which may hold some clues on how to do it.
>>
>> Decreasing the mom polling frequency down to 5 minutes(was 3 minutes) seems
>> to at least be preventing the jobs from becoming blocked for good(the jobs
>> that get marked "Deferred" return to the "Idle" status, rather than
>> "BatchHold"), although it doesn't prevent the system having it's little
>> fits every few minutes.
>>
>> Thanks again,
>> Leigh
>>
>> On Thursday 16 August 2007 06:43:45 David Singleton wrote:
>>> Hi Leigh,
>>>
>>> I'm guessing the problem is the way torque MOMs trawl /proc on
>>> Linux causing the MOM to spend a lot of time very busy.  Have
>>> you strace'd the MOM? Is your MOM running in the same cpuset
>>> as jobs or otherwise fighting for cpu time?
>>>
>>> I haven't looked at this for a while but it used to be that the MOM
>>> would run through the whole of /proc multiple times per job (in
>>> cput_sum(), overcpu_proc(), mem_sum(), ....).  Reading /proc on a
>>> full 128P system is not cheap and if most of your jobs are single
>>> cpu then your multiplier of this number is large.  The cost is about
>>> 64^2 times what it is on a cluster of 2P nodes.
>>>
>>> There are multiple ways this can be improved.  On solaris and
>>> digitalunix, MOMs reuse a single trawling of /proc for multiple jobs.
>>> Since most uses are sampling for resource usage/limits, it doesn't
>>> matter if the sampled /proc data is a few millisecs stale.  [On top
>>> of this, the binary blob dump of /proc entries on Solaris and Tru64
>>> is noticeably *faster* than Linux's annoying "readable" format so
>>> MOM on Linux takes a double hit in comparison.]
>>>
>>> Another alternative is to use cpusets or SGI "process aggregates"
>>> (as used in SGI array sessions or their "job" module) to have the
>>> kernel accurately track the list of processes in a job (actually in
>>> each task of a job).  Then, for each job, MOM can just loop over the
>>> list of pids in these job "containers" task lists instead of the
>>> whole of /proc.  [Good news for Linux: a generic "process container"
>>> infrastructure is hopefully being added to the kernel. It effectively
>>> provides an indelible session id and a means of accessing job
>>> task lists.]
>>>
>>> There is benefit in applying both these modifications.
>>>
>>> I could be totally wrong about what is causing your system grief but
>>> it does sound like a MOM scalability issue. If so, fixing it will
>>> require some development work.
>>>
>>> Cheers,
>>> David
>
>
>
>
>
-------------- next part --------------
--- src/resmom/linux/mom_mach.c	2006-07-13 01:25:19.000000000 +1000
+++ /cs/datastore/csssg/wil240/torque-2.1.2-snap.200607191251/src/resmom/linux/mom_mach.c	2007-02-22 13:42:29.345161288 +1100
@@ -232,6 +232,14 @@
 
 unsigned linux_time = 0;
 
+const char *cpusetroot[] = { /* places where cpuset may be */
+  "/dev/cpuset/torque/acsm/",
+  "/dev/cpuset/torque/scsm-overload/",
+  "/dev/cpuset/boot/"};
+const int                    cpusetroots = 3;
+extern void                  rmnl( char * ); /* in mom_main */
+extern char                  *skipwhite( char * ); /* in mom_main */
+
 /*
  * support routine for getting system time -- sets linux_time
  */
@@ -713,8 +721,60 @@
   int            nps = 0;
   proc_stat_t   *ps;
 
+  int            i;
+  static char           path[1024];
+  FILE                 *fd;
+  char                   line[120];
+  char                  *str;
+
   cputime = 0;
 
+  for (i = 0;i < cpusetroots;i++)
+    {
+    sprintf(path,"%s%s/tasks", cpusetroot[i], pjob->ji_qs.ji_jobid);
+    if (fd = fopen(path,"r"))
+      break;
+    }
+  if (fd)
+    {
+    while (fgets(line,sizeof(line),fd))
+      {
+      str = skipwhite(line);      /* pass over initial whitespace */
+      rmnl(str);
+      if (!isdigit(str[0]))
+        continue;
+      if ((ps = get_proc_stat(atoi(str))) == NULL)  
+        {
+        if (errno != ENOENT) 
+          {
+          sprintf(log_buffer,"%s: get_proc_stat", str);
+          log_err(errno,id,log_buffer);
+          }
+        continue;
+        }
+      nps++;
+      cputime += (ps->utime + ps->stime + ps->cutime + ps->cstime);
+      if (LOGLEVEL >= 6)
+        {
+        sprintf(log_buffer,"%s, cpuset: session=%d pid=%d cputime=%lu (cputfactor=%f)",
+          id, 
+          ps->session, 
+          ps->pid, 
+          cputime,
+          cputfactor);
+        log_record(PBSEVENT_SYSTEM,0,id,log_buffer);
+        }
+      }
+    if (nps == 0)
+      pjob->ji_flags |= MOM_NO_PROC;
+    else
+      pjob->ji_flags &= ~MOM_NO_PROC;
+    fclose(fd);
+    return((unsigned long)((double)cputime * cputfactor));
+    }
+
+  return((unsigned long)((double) 0 ));
+
   rewinddir(pdir);
 
   while ((dent = readdir(pdir)) != NULL) 
@@ -843,8 +903,45 @@
   unsigned long		segadd;
   proc_stat_t		*ps;
 
+  int            i;
+  static char           path[1024];
+  FILE                 *fd;
+  char                   line[120];
+  char                  *str;
+
   segadd = 0;
 
+  for (i = 0;i < cpusetroots;i++)
+    {
+    sprintf(path,"%s%s/tasks", cpusetroot[i], pjob->ji_qs.ji_jobid);
+    if (fd = fopen(path,"r"))
+      break;
+    }
+  if (fd)
+    {
+    while (fgets(line,sizeof(line),fd))
+      {
+      str = skipwhite(line);      /* pass over initial whitespace */
+      rmnl(str);
+      if (!isdigit(str[0]))
+        continue;
+      if ((ps = get_proc_stat(atoi(str))) == NULL)  
+        {
+        if (errno != ENOENT) 
+          {
+          sprintf(log_buffer,"%s: get_proc_stat", str);
+          log_err(errno,id,log_buffer);
+          }
+        continue;
+        }
+      segadd += ps->vsize;
+      }
+    fclose(fd);
+    return(segadd);
+    }
+
+  return((unsigned long) 0 );
+
   rewinddir(pdir);
 
   while ((dent = readdir(pdir)) != NULL) 
@@ -892,8 +989,45 @@
   struct dirent	*dent;
   proc_stat_t	*ps;
 
+  int            i;
+  static char           path[1024];
+  FILE                 *fd;
+  char                   line[120];
+  char                  *str;
+
   resisize = 0;
 
+  for (i = 0;i < cpusetroots;i++)
+    {
+    sprintf(path,"%s%s/tasks", cpusetroot[i], pjob->ji_qs.ji_jobid);
+    if (fd = fopen(path,"r"))
+      break;
+    }
+  if (fd)
+    {
+    while (fgets(line,sizeof(line),fd))
+      {
+      str = skipwhite(line);      /* pass over initial whitespace */
+      rmnl(str);
+      if (!isdigit(str[0]))
+        continue;
+      if ((ps = get_proc_stat(atoi(str))) == NULL)  
+        {
+        if (errno != ENOENT) 
+          {
+          sprintf(log_buffer,"%s: get_proc_stat", str);
+          log_err(errno,id,log_buffer);
+          }
+        continue;
+        }
+      resisize += ps->rss * pagesize;
+      }
+    fclose(fd);
+    return(resisize);
+    }
+
+  return((unsigned long) 0 );
+
   rewinddir(pdir);
 
   while ((dent = readdir(pdir)) != NULL) 
@@ -1807,6 +1941,15 @@
   int            sesid;
   pid_t          mompid;
 
+/****************/
+  int            i;
+  static char           path[1024];
+  FILE                 *fd;
+  char                   line[120];
+  char                  *str;
+/***************/
+
+
   sesid = ptask->ti_qs.ti_sid;
   mompid = getpid();
 
@@ -1841,6 +1984,143 @@
       log_buffer);
     }
 
+
+/****************/
+  for (i = 0;i < cpusetroots;i++)
+    {
+/*    sprintf(path,"%s%s/tasks", cpusetroot[i], pjob->ji_qs.ji_jobid); */
+    sprintf(path,"%s%s/tasks", cpusetroot[i], ptask->ti_job->ji_qs.ji_jobid);
+    if (fd = fopen(path,"r"))
+      break;
+    }
+  if (fd)
+    {
+    while (fgets(line,sizeof(line),fd))
+      {
+      str = skipwhite(line);      /* pass over initial whitespace */
+      rmnl(str);
+      if (!isdigit(str[0]))
+        continue;
+
+    if ((ps = get_proc_stat(atoi(str))) == NULL) 
+      {
+      if (errno != ENOENT) 
+        {
+        sprintf(log_buffer,"%s: get_proc_stat", 
+          str);
+
+        log_err(errno,id,log_buffer);
+        }
+ 
+      continue;
+      }
+
+/*    if (sesid == ps->session)  */
+    if (1) 
+      {
+      if ((ps->state == 'Z') || (ps->pid == 0))
+        {
+        /*
+         * Killing a zombie is sure death! Its pid is zero,
+         * which to kill(2) means 'every process in the process
+         * group of the current process'.
+         */
+
+        sprintf(log_buffer,"%s: not killing pid 0 with sig %d",
+          id,
+          sig);
+
+        log_record(
+          PBSEVENT_JOB,
+          PBS_EVENTCLASS_JOB,
+          ptask->ti_job->ji_qs.ji_jobid,
+          log_buffer);
+        }
+      else
+        {
+        int i = 0;
+
+        if (ps->pid == mompid)
+          {
+          /*
+	   * there is a race condition with newly started jobs that
+           * can be killed before they've established their own
+           * session id.  This means the child tasks still have MOM's
+           * session id.  We check this to make sure MOM doesn't kill
+           * herself. 
+           */
+
+          continue;
+          }
+
+        if (sig == SIGKILL) 
+          {
+          struct timespec req;
+
+          req.tv_sec = 0;
+          req.tv_nsec = 250000000;  /* .25 seconds */
+
+          /* give the process some time to quit gracefully first (up to 5 seconds) */
+
+          if (pg == 0)
+            kill(ps->pid,SIGTERM);
+          else
+            killpg(ps->pid,SIGTERM);
+
+          for (i = 0;i < 20;i++) 
+            {
+            /* check if process is gone */
+
+            if (kill(ps->pid,0) == -1) 
+              break;
+
+            nanosleep(&req,NULL);
+            }  /* END for (i = 0) */
+          }    /* END if (sig == SIGKILL) */
+        else
+          {
+          i = 20;
+          }
+
+        sprintf(log_buffer,"%s: killing pid %d task %d with sig %d",
+          id, 
+          ps->pid, 
+          ptask->ti_qs.ti_task, 
+          sig);
+
+        log_record(
+          PBSEVENT_JOB,
+          PBS_EVENTCLASS_JOB,
+          ptask->ti_job->ji_qs.ji_jobid,
+          log_buffer);
+
+        if (i >= 20)
+          {
+          /* kill process hard */
+
+          /* should this be replaced w/killpg() to kill all children? */
+
+          if (pg == 0)
+            kill(ps->pid,sig);
+          else
+            killpg(ps->pid,sig);
+          }
+
+        ++ct;
+        }  /* END else ((ps->state == 'Z') || (ps->pid == 0)) */
+      }    /* END if (sesid == ps->session) */
+    }      /* END while ((dent = readdir(pdir)) != NULL) */
+
+  /* SUCCESS */
+
+    fclose(fd);
+    return(ct);
+    }
+
+  return(0);
+
+/**********************/
+
   /* pdir is global */
 
   rewinddir(pdir);
@@ -2586,6 +2866,9 @@
   static	int	maxjid = 200;
   register pid_t jobid;
 
+/* avoid overhead */
+  return("not_checked");
+
   if (attrib != NULL) 
     {
     log_err(-1,id,extra_parm);
@@ -2874,6 +3157,7 @@
   uid_t		*uids, *hold;
   static	int	maxuid = 200;
   register	uid_t	uid;
+  return("not_checked");
 
   if (attrib) 
     {


More information about the torqueusers mailing list