[Mauiusers] insufficient idle procs available ?

Jan Ploski Jan.Ploski at offis.de
Fri Feb 1 09:05:15 MST 2008

"Itay M" <itaym.tau at gmail.com> schrieb am 02/01/2008 04:01:26 AM:

> Well, you definitley came up with something interesting. The 
> NODEAVAILABILITYPOLICY looks as if it should help me to resolve this
> issue (but currently it didn't...yet).
> I've made the following tests trying to figure what's behind the 
> scenes of the cluster:
> 1. I listed all the nodes that diagnose -n says: "has more 
> processors utilized than dedicated"
> 2. Then I submitted several very short jobs (2 minutes) and 
> designated each one of them to each one of the nodes listed above. I
> used the -l host={nodename} -l walltime=00:00:02

It looks more like two seconds to me?...

> (The walltime time 
> purpose is to make sure MAUI will not activate any reservation 
> policies on the jobs (In fact the cluster had many free CPUs at the 
> time I made the test, so no reservations are expected)). I expected 
> the jobs *not* to go in to R state, because each and every job was 
> targeted to a node that "has more processors utilized than dedicated" .
> 3. Indeed that's what happend! None of the jobs went from Q state to
> R state. They have been waiting there for very long time (hours).
> 4. I then checked the load average on each of the nodes listed 
> above, and I indeed found that their load average is higher than 
> their configured resources. For example, if the 'nodes' file says 
> 'node22 np=4' , I checked it's load average at the time it had the 
> "has more processors utilized than dedicated" . I found that though 
> this node runs only 2 jobs at the moment, the load average is above 
> it (about 2.70). I expect this node to run 4 jobs at the same time.
> > Are these2 jobs multithreaded? Is the load ~4 while it should be ~2?
> I'm not sure if they are multithreaded (needs further checking with 
> the developers) -  but you're right. The load should be no more than
> 2 for 2 jobs, but infact its >2 . The jobs are C++ compiled with g++
> compiler. Maybe a compilation switch will help with reducing the 
> load average to 1 per job?

Ask the programmers. The load average could also be caused by other 
(non-job) activity on the node. Just run top and see how much CPU% each of 
the job executables consumes. For a multithreaded job (and a 2.6.x kernel) 
you will see >100% CPU usage in top.

> I then moved to the next step, and set the NODEAVAILABILITYPOLICY to
> UTILIZED. The showconfig command now says:
> As this didn't make the jobs run, perhaps it's a matter of another 
> tweak in the NODEAVAILABILTY policy?

You can set it to DEDICATED for testing. Then the load average should not 
matter at all in scheduling decisions, just the number of jobs assigned to 
the node should be taken into account. However, you probably don't want to 
keep this setting in the long term because then you risk overcommitting 
the nodes and decreasing overall performance.

> And yet another thing about the diagnose -j output : I'm not sure if
> and how should I treat the  'WARNING:  job '{job_id}' utilizes more 
> memory than dedicated (xxxx > 512) ' .  A vmstat test shows that 
> indeed jobs are heavily swapping on the node.

It seems that memory is the bottleneck in your setup. You should make the 
jobs ask for as much memory as they need on average (rather than the 
default 512MB). Of course, with the UTILIZED or COMBINED policy it will 
mean that the idle jobs won't get assigned to a node where such a 
memory-hog job is running. However, that's probably still better for your 
throughput than allowing swapping to happen (much depends on memory access 
patterns in the programs - if they need lots of memory, but seldom access 
all of it, swapping may be acceptable). You could also buy more memory or 
tell people to write more memory-efficient code...

Jan Ploski

More information about the mauiusers mailing list