[Mauiusers] Help with Maui
tomr at intrinsity.com
Thu Jun 26 12:53:05 MDT 2008
If you want Maui to run jobs faster, you should look at this patch
which I submitted to the list a while back. It changes maui to use
asynchronous job runs and speeds up the scheduling quite a bit. We've
been running about 200,000 jobs/day with it for the last year and a
half. (About 100 million jobs)
Jeremy Mann wrote:
> First, Maui has pretty much solved our problems with submitting 100k+
> jobs. It was simply overloading the stock PBS scheduler. But now I'm
> coming into a (maybe) configuration problem.
> I have one user who submits 100k+ jobs. This works fine for the first
> couple thousand jobs, which line up in the queue and begin to execute.
> However after a certain time, the queue is only visibily filled with 15
> jobs. There are still thousands of jobs hidden. I can see this if I run
> qstat multiple times in a row, the queue is repopulated with 15 more jobs.
> The jobs only run for about 45 seconds so I'm thinking Maui isn't picking
> up on this?
> My second question is probably more important. We run two PBS environments
> on all of our clusters. One is our 'default' high priority queue and the
> second is the 'default' low priority queue. The low priority queue is for
> jobs that run at nice 19 or 20. So we load up the low priority queue with
> niced jobs and don't care how long they take to finish. This leaves the
> high priority queue to process our own grid and MPI jobs.
> This has worked fine for awhile, but now I have a user who wants to run a
> few hundred thousand jobs in our low priority queue (see paragraph 1 and
> 2). The stock pbs_sched was simply getting overloaded and would crash.
> This is when I set up a test cluster using PBS/Maui and we haven't had a
> problem (other than the 15 queue limit I spoke of before).
> Yesterday, I set up a second Maui to schedule the low priority queue. I
> could submit jobs, check job status, however the jobs would never run.
> This is when I started checking the Maui logs and found checksum errors.
> This is when I discovered the problem. The $PATH environment picks up the
> normally installed Maui and uses its binaries to perform its functions.
> Turns out the checksum error is when the normally installed Maui tries to
> process and query the second low priority queue. I can get around this by
> using the second installed Maui's binaries to query the low priority
> queue. If there is a way to disable this at compilation time, I think my
> problem will go away.
> I look forward to any comments or questions!
-------------- next part --------------
--- src/moab/MPBSI.c.no_async 2007-06-26 11:35:14.000000000 -0500
+++ src/moab/MPBSI.c 2007-10-10 14:39:51.000000000 -0500
@@ -1904,7 +1904,7 @@
@@ -1939,6 +1939,7 @@
@@ -2017,7 +2018,7 @@
- rc = pbs_runjob(R->U.PBS.ServerSD,tmpJobName,MasterHost,NULL);
+ rc = pbs_asyrunjob(R->U.PBS.ServerSD,tmpJobName,HostList,NULL);
if (rc != 0)
@@ -2041,6 +2042,7 @@
JobStartFailed = TRUE;
if (J->NeedNodes != NULL)
@@ -2062,7 +2064,7 @@
if (JobStartFailed == TRUE)
/* job could not be started */
More information about the mauiusers