[torqueusers] Torque behaving strangely, jobs becoming blockedindefinitely
Gareth Williams
wil240 at csiro.au
Sun Aug 26 22:11:33 MDT 2007
Hi Leigh,
As discussed offline, I have a patch for resmom/linux/mom_mach.c
(attached) to make it only look for processes in cpusets, rather than
trawling repeatedly through /proc. This should suit you with a little
tweaking, given that you have your own torque_submitfilter cpusets scheme.
For the list, note that this will only work for a single node setup as
toque_submitfilter can only easily influence the environment of the
jobscript, whereas in a multi-node environmnet, tasks started on other
nodes with tm_spawn cannot be as easily placed in a job-temporary cpuset.
I think for a more general fix, a revised integration of cpusets into
pbs_mom/tm_spawn is needed.
cheers,
Gareth Williams, CSIRO HPSC
On Thu, 23 Aug 2007, Leigh Gordon wrote:
> Hi everyone,
>
> This problem has just manifested itself again, in a different way. Users' jobs
> are actually being queued, and then being killed, rather than sitting in the
> BLOCKED queue as before.
-snip-
> Even when this finishes and the strace goes back to a sane speed(just
> reading /proc/loadavg every now and then), the node reports as down for a
> while, then comes back up, only to be taken down by MOM again!
>
> Is there ANY way to stop the system from crawling to a halt and killing jobs
> as soon as they are submitted? Even if people's jobs are queued and checked
> less frequently, it would be preferable to the jobs being killed for no
> reason and having to try and resubmit them hoping that it will work this
> time! Frustrating as you can imagine :)
>
> It seems that even when there's sufficient CPUs free, by the time it's gets
> around to scheduling the job, it's back to it's "/proc trawling" again so
> then can't communicate and the job either enters the queue and then dies, or
> remains in a deferred state.
>
> Any ideas? I realise a proper fix involves rewriting the program for better
> scalability, but does anyone have any ideas to alleviate this problem in the
> meantime? Maybe suggest which values in MOM/maui/pbs_server config can be
> tweaked to lighten to load?
>
> I think you were right on the money David, this /proc business is a bit
> ridiculous and it's ironic that something with this much grunt can be slowed
> to a crawl by a single process!
>
> Thanks
> Leigh
>
> --
> Regards,
>
> Leigh Gordon
> High Performance Computing Systems Administrator
> IT Resources, University of Tasmania
> Phone: 03 6226 6389
> http://www.tpac.org.au
>
> "What we're really after is simply that people acquire a legal license for
> Windows for each computer they own before they move on to Linux or Sun
> Solaris or BSD or OS/2 or whatever."
> - Bill Gates -
>
> On Mon, 20 Aug 2007 08:54:55 am Leigh Gordon wrote:
>> Hi, thanks for the reply.
>>
>> I've looked into it a bit more(even going as far as looking at the source
>> code for 'top', which unfortunately also uses "/proc" rather than system
>> calls) and it seems like it's a job for someone with a lot more experience
>> with C than myself. I did find a GPL project called Supermon, which I read
>> a paper on(and it specifically said it used system calls rather than /proc
>> on Linux) which may hold some clues on how to do it.
>>
>> Decreasing the mom polling frequency down to 5 minutes(was 3 minutes) seems
>> to at least be preventing the jobs from becoming blocked for good(the jobs
>> that get marked "Deferred" return to the "Idle" status, rather than
>> "BatchHold"), although it doesn't prevent the system having it's little
>> fits every few minutes.
>>
>> Thanks again,
>> Leigh
>>
>> On Thursday 16 August 2007 06:43:45 David Singleton wrote:
>>> Hi Leigh,
>>>
>>> I'm guessing the problem is the way torque MOMs trawl /proc on
>>> Linux causing the MOM to spend a lot of time very busy. Have
>>> you strace'd the MOM? Is your MOM running in the same cpuset
>>> as jobs or otherwise fighting for cpu time?
>>>
>>> I haven't looked at this for a while but it used to be that the MOM
>>> would run through the whole of /proc multiple times per job (in
>>> cput_sum(), overcpu_proc(), mem_sum(), ....). Reading /proc on a
>>> full 128P system is not cheap and if most of your jobs are single
>>> cpu then your multiplier of this number is large. The cost is about
>>> 64^2 times what it is on a cluster of 2P nodes.
>>>
>>> There are multiple ways this can be improved. On solaris and
>>> digitalunix, MOMs reuse a single trawling of /proc for multiple jobs.
>>> Since most uses are sampling for resource usage/limits, it doesn't
>>> matter if the sampled /proc data is a few millisecs stale. [On top
>>> of this, the binary blob dump of /proc entries on Solaris and Tru64
>>> is noticeably *faster* than Linux's annoying "readable" format so
>>> MOM on Linux takes a double hit in comparison.]
>>>
>>> Another alternative is to use cpusets or SGI "process aggregates"
>>> (as used in SGI array sessions or their "job" module) to have the
>>> kernel accurately track the list of processes in a job (actually in
>>> each task of a job). Then, for each job, MOM can just loop over the
>>> list of pids in these job "containers" task lists instead of the
>>> whole of /proc. [Good news for Linux: a generic "process container"
>>> infrastructure is hopefully being added to the kernel. It effectively
>>> provides an indelible session id and a means of accessing job
>>> task lists.]
>>>
>>> There is benefit in applying both these modifications.
>>>
>>> I could be totally wrong about what is causing your system grief but
>>> it does sound like a MOM scalability issue. If so, fixing it will
>>> require some development work.
>>>
>>> Cheers,
>>> David
>
>
>
>
>
-------------- next part --------------
--- src/resmom/linux/mom_mach.c 2006-07-13 01:25:19.000000000 +1000
+++ /cs/datastore/csssg/wil240/torque-2.1.2-snap.200607191251/src/resmom/linux/mom_mach.c 2007-02-22 13:42:29.345161288 +1100
@@ -232,6 +232,14 @@
unsigned linux_time = 0;
+const char *cpusetroot[] = { /* places where cpuset may be */
+ "/dev/cpuset/torque/acsm/",
+ "/dev/cpuset/torque/scsm-overload/",
+ "/dev/cpuset/boot/"};
+const int cpusetroots = 3;
+extern void rmnl( char * ); /* in mom_main */
+extern char *skipwhite( char * ); /* in mom_main */
+
/*
* support routine for getting system time -- sets linux_time
*/
@@ -713,8 +721,60 @@
int nps = 0;
proc_stat_t *ps;
+ int i;
+ static char path[1024];
+ FILE *fd;
+ char line[120];
+ char *str;
+
cputime = 0;
+ for (i = 0;i < cpusetroots;i++)
+ {
+ sprintf(path,"%s%s/tasks", cpusetroot[i], pjob->ji_qs.ji_jobid);
+ if (fd = fopen(path,"r"))
+ break;
+ }
+ if (fd)
+ {
+ while (fgets(line,sizeof(line),fd))
+ {
+ str = skipwhite(line); /* pass over initial whitespace */
+ rmnl(str);
+ if (!isdigit(str[0]))
+ continue;
+ if ((ps = get_proc_stat(atoi(str))) == NULL)
+ {
+ if (errno != ENOENT)
+ {
+ sprintf(log_buffer,"%s: get_proc_stat", str);
+ log_err(errno,id,log_buffer);
+ }
+ continue;
+ }
+ nps++;
+ cputime += (ps->utime + ps->stime + ps->cutime + ps->cstime);
+ if (LOGLEVEL >= 6)
+ {
+ sprintf(log_buffer,"%s, cpuset: session=%d pid=%d cputime=%lu (cputfactor=%f)",
+ id,
+ ps->session,
+ ps->pid,
+ cputime,
+ cputfactor);
+ log_record(PBSEVENT_SYSTEM,0,id,log_buffer);
+ }
+ }
+ if (nps == 0)
+ pjob->ji_flags |= MOM_NO_PROC;
+ else
+ pjob->ji_flags &= ~MOM_NO_PROC;
+ fclose(fd);
+ return((unsigned long)((double)cputime * cputfactor));
+ }
+
+ return((unsigned long)((double) 0 ));
+
rewinddir(pdir);
while ((dent = readdir(pdir)) != NULL)
@@ -843,8 +903,45 @@
unsigned long segadd;
proc_stat_t *ps;
+ int i;
+ static char path[1024];
+ FILE *fd;
+ char line[120];
+ char *str;
+
segadd = 0;
+ for (i = 0;i < cpusetroots;i++)
+ {
+ sprintf(path,"%s%s/tasks", cpusetroot[i], pjob->ji_qs.ji_jobid);
+ if (fd = fopen(path,"r"))
+ break;
+ }
+ if (fd)
+ {
+ while (fgets(line,sizeof(line),fd))
+ {
+ str = skipwhite(line); /* pass over initial whitespace */
+ rmnl(str);
+ if (!isdigit(str[0]))
+ continue;
+ if ((ps = get_proc_stat(atoi(str))) == NULL)
+ {
+ if (errno != ENOENT)
+ {
+ sprintf(log_buffer,"%s: get_proc_stat", str);
+ log_err(errno,id,log_buffer);
+ }
+ continue;
+ }
+ segadd += ps->vsize;
+ }
+ fclose(fd);
+ return(segadd);
+ }
+
+ return((unsigned long) 0 );
+
rewinddir(pdir);
while ((dent = readdir(pdir)) != NULL)
@@ -892,8 +989,45 @@
struct dirent *dent;
proc_stat_t *ps;
+ int i;
+ static char path[1024];
+ FILE *fd;
+ char line[120];
+ char *str;
+
resisize = 0;
+ for (i = 0;i < cpusetroots;i++)
+ {
+ sprintf(path,"%s%s/tasks", cpusetroot[i], pjob->ji_qs.ji_jobid);
+ if (fd = fopen(path,"r"))
+ break;
+ }
+ if (fd)
+ {
+ while (fgets(line,sizeof(line),fd))
+ {
+ str = skipwhite(line); /* pass over initial whitespace */
+ rmnl(str);
+ if (!isdigit(str[0]))
+ continue;
+ if ((ps = get_proc_stat(atoi(str))) == NULL)
+ {
+ if (errno != ENOENT)
+ {
+ sprintf(log_buffer,"%s: get_proc_stat", str);
+ log_err(errno,id,log_buffer);
+ }
+ continue;
+ }
+ resisize += ps->rss * pagesize;
+ }
+ fclose(fd);
+ return(resisize);
+ }
+
+ return((unsigned long) 0 );
+
rewinddir(pdir);
while ((dent = readdir(pdir)) != NULL)
@@ -1807,6 +1941,15 @@
int sesid;
pid_t mompid;
+/****************/
+ int i;
+ static char path[1024];
+ FILE *fd;
+ char line[120];
+ char *str;
+/***************/
+
+
sesid = ptask->ti_qs.ti_sid;
mompid = getpid();
@@ -1841,6 +1984,143 @@
log_buffer);
}
+
+/****************/
+ for (i = 0;i < cpusetroots;i++)
+ {
+/* sprintf(path,"%s%s/tasks", cpusetroot[i], pjob->ji_qs.ji_jobid); */
+ sprintf(path,"%s%s/tasks", cpusetroot[i], ptask->ti_job->ji_qs.ji_jobid);
+ if (fd = fopen(path,"r"))
+ break;
+ }
+ if (fd)
+ {
+ while (fgets(line,sizeof(line),fd))
+ {
+ str = skipwhite(line); /* pass over initial whitespace */
+ rmnl(str);
+ if (!isdigit(str[0]))
+ continue;
+
+ if ((ps = get_proc_stat(atoi(str))) == NULL)
+ {
+ if (errno != ENOENT)
+ {
+ sprintf(log_buffer,"%s: get_proc_stat",
+ str);
+
+ log_err(errno,id,log_buffer);
+ }
+
+ continue;
+ }
+
+/* if (sesid == ps->session) */
+ if (1)
+ {
+ if ((ps->state == 'Z') || (ps->pid == 0))
+ {
+ /*
+ * Killing a zombie is sure death! Its pid is zero,
+ * which to kill(2) means 'every process in the process
+ * group of the current process'.
+ */
+
+ sprintf(log_buffer,"%s: not killing pid 0 with sig %d",
+ id,
+ sig);
+
+ log_record(
+ PBSEVENT_JOB,
+ PBS_EVENTCLASS_JOB,
+ ptask->ti_job->ji_qs.ji_jobid,
+ log_buffer);
+ }
+ else
+ {
+ int i = 0;
+
+ if (ps->pid == mompid)
+ {
+ /*
+ * there is a race condition with newly started jobs that
+ * can be killed before they've established their own
+ * session id. This means the child tasks still have MOM's
+ * session id. We check this to make sure MOM doesn't kill
+ * herself.
+ */
+
+ continue;
+ }
+
+ if (sig == SIGKILL)
+ {
+ struct timespec req;
+
+ req.tv_sec = 0;
+ req.tv_nsec = 250000000; /* .25 seconds */
+
+ /* give the process some time to quit gracefully first (up to 5 seconds) */
+
+ if (pg == 0)
+ kill(ps->pid,SIGTERM);
+ else
+ killpg(ps->pid,SIGTERM);
+
+ for (i = 0;i < 20;i++)
+ {
+ /* check if process is gone */
+
+ if (kill(ps->pid,0) == -1)
+ break;
+
+ nanosleep(&req,NULL);
+ } /* END for (i = 0) */
+ } /* END if (sig == SIGKILL) */
+ else
+ {
+ i = 20;
+ }
+
+ sprintf(log_buffer,"%s: killing pid %d task %d with sig %d",
+ id,
+ ps->pid,
+ ptask->ti_qs.ti_task,
+ sig);
+
+ log_record(
+ PBSEVENT_JOB,
+ PBS_EVENTCLASS_JOB,
+ ptask->ti_job->ji_qs.ji_jobid,
+ log_buffer);
+
+ if (i >= 20)
+ {
+ /* kill process hard */
+
+ /* should this be replaced w/killpg() to kill all children? */
+
+ if (pg == 0)
+ kill(ps->pid,sig);
+ else
+ killpg(ps->pid,sig);
+ }
+
+ ++ct;
+ } /* END else ((ps->state == 'Z') || (ps->pid == 0)) */
+ } /* END if (sesid == ps->session) */
+ } /* END while ((dent = readdir(pdir)) != NULL) */
+
+ /* SUCCESS */
+
+ fclose(fd);
+ return(ct);
+ }
+
+ return(0);
+
+/**********************/
+
/* pdir is global */
rewinddir(pdir);
@@ -2586,6 +2866,9 @@
static int maxjid = 200;
register pid_t jobid;
+/* avoid overhead */
+ return("not_checked");
+
if (attrib != NULL)
{
log_err(-1,id,extra_parm);
@@ -2874,6 +3157,7 @@
uid_t *uids, *hold;
static int maxuid = 200;
register uid_t uid;
+ return("not_checked");
if (attrib)
{
More information about the torqueusers
mailing list