[Mauiusers] Help with Maui

Tom Rudwick tomr at intrinsity.com
Thu Jun 26 12:53:05 MDT 2008


If you want Maui to run jobs faster, you should look at this patch
which I submitted to the list a while back. It changes maui to use
asynchronous job runs and speeds up the scheduling quite a bit. We've
been running about 200,000 jobs/day with it for the last year and a
half. (About 100 million jobs)

Tom

Jeremy Mann wrote:
> First, Maui has pretty much solved our problems with submitting 100k+
> jobs. It was simply overloading the stock PBS scheduler. But now I'm
> coming into a (maybe) configuration problem.
> 
> I have one user who submits 100k+ jobs. This works fine for the first
> couple thousand jobs, which line up in the queue and begin to execute.
> However after a certain time, the queue is only visibily filled with 15
> jobs. There are still thousands of jobs hidden. I can see this if I run
> qstat multiple times in a row, the queue is repopulated with 15 more jobs.
> The jobs only run for about 45 seconds so I'm thinking Maui isn't picking
> up on this?
> 
> My second question is probably more important. We run two PBS environments
> on all of our clusters. One is our 'default' high priority queue and the
> second is the 'default' low priority queue. The low priority queue is for
> jobs that run at nice 19 or 20. So we load up the low priority queue with
> niced jobs and don't care how long they take to finish. This leaves the
> high priority queue to process our own grid and MPI jobs.
> 
> This has worked fine for awhile, but now I have a user who wants to run a
> few hundred thousand jobs in our low priority queue (see paragraph 1 and
> 2). The stock pbs_sched was simply getting overloaded and would crash.
> This is when I set up a test cluster using PBS/Maui and we haven't had a
> problem (other than the 15 queue limit I spoke of before).
> 
> Yesterday, I set up a second Maui to schedule the low priority queue. I
> could submit jobs, check job status, however the jobs would never run.
> This is when I started checking the Maui logs and found checksum errors.
> This is when I discovered the problem. The $PATH environment picks up the
> normally installed Maui and uses its binaries to perform its functions.
> Turns out the checksum error is when the normally installed Maui tries to
> process and query the second low priority queue. I can get around this by
> using the second installed Maui's binaries to query the low priority
> queue. If there is a way to disable this at compilation time, I think my
> problem will go away.
> 
> I look forward to any comments or questions!
> 
> 

-------------- next part --------------
--- src/moab/MPBSI.c.no_async	2007-06-26 11:35:14.000000000 -0500
+++ src/moab/MPBSI.c	2007-10-10 14:39:51.000000000 -0500
@@ -1904,7 +1904,7 @@
 
       return(FAILURE);
       }
-
+    /*
     if (MPBSJobModify(
           J,
           R,
@@ -1939,6 +1939,7 @@
         J->Name,
         HostList);
       }
+    */
     }
   else
     {
@@ -2017,7 +2018,7 @@
 
   MJobGetName(J,NULL,R,tmpJobName,sizeof(tmpJobName),mjnRMName);       
 
-  rc = pbs_runjob(R->U.PBS.ServerSD,tmpJobName,MasterHost,NULL);
+  rc = pbs_asyrunjob(R->U.PBS.ServerSD,tmpJobName,HostList,NULL);
 
   if (rc != 0)
     {
@@ -2041,6 +2042,7 @@
     JobStartFailed = TRUE;
     }
 
+  /*
   if (J->NeedNodes != NULL)
     {
     if (MPBSJobModify(
@@ -2062,7 +2064,7 @@
         J->NeedNodes);
       }
     }
-
+  */
   if (JobStartFailed == TRUE)
     {
     /* job could not be started */


More information about the mauiusers mailing list