[Mauiusers] maui segfaults trying to schedule a job (SOLVED)

DuChene, StevenX A stevenx.a.duchene at intel.com
Tue Nov 29 11:53:57 MST 2011


Thanks for the suggestions.
I bumped up the following:

In mcom.h 
Changed  MMAX_BUFFER from 65536 to 131072

In msched-common.h
Changed MAX_MBUFFER from 65536 to 131072
Changed MMAX_BUFFER from 65536 to 131072
Changed MAX_MTASK from 4096 to 8192

In msched.h
Changed MAX_MREQ_PER_JOB from 4 to 8

Once I did that the job request would not cause maui to segfault.

BTW, isn't there someway that this could just kick out some sort of error into the logs instead of just silently causing a segfault? I spent quite a bit of time looking around in the logs and running maui through strace to see what might be wrong. If it would log something about these constants being too low that would be a real big help in tracking down what needs to be changed.

Also if this was a run time change that could be made in a configuration file instead of having to edit include files and recompile would be great.
--
Steven DuChene

-----Original Message-----
From: DuChene, StevenX A 
Sent: Tuesday, November 29, 2011 9:41 AM
To: 'Michel Béland'
Cc: mauiusers at supercluster.org
Subject: RE: [Mauiusers] maui segfaults trying to schedule a job

OK, I see in mcom.h MMAX_BUFFER is set to 65536 and MAX_MBUFFER is set to 65536 in msched_common.h

Our node names are 8 characters long and this job would be requesting 172 nodes specifically so that would be 1376 characters. 
--
Steven DuChene


-----Original Message-----
From: Michel Béland [mailto:michel.beland at calculquebec.ca] 
Sent: Tuesday, November 29, 2011 7:22 AM
To: DuChene, StevenX A
Cc: mauiusers at supercluster.org
Subject: Re: [Mauiusers] maui segfaults trying to schedule a job

DuChene, StevenX A a écrit :
>
> This morning I discovered that the maui scheduler process was not 
> running on one of our clusters like it should. When I try to start the 
> maui process as the maui user I get a segmentation fault. In checking 
> the log files the last few entries look like this:
>
>  
>
>
(...)
>
>  There is only this one job in the queue on a 256 node cluster running 
> torque 2.5.7 and maui 3.2.6p21
>
>  
>
> I have tried starting the maui process within strace but I do not see 
> any smoking gun in that strace output.
>
>  
>
> I can probably get maui to start if I qdel the job but I was sort of 
> hoping to see what was causing the problem in case any additional 
> debugging output was needed.
>
>

I guess that you have more than 16 cores per node so that your job 
requests more that 4096 cores. In that case, you have to increase 
MAX_MTASK in include/msched-common.h and recompile. It hast to be equal 
or greater than the number of cores in the cluster. You have to watch 
out also for the size of MAX_MBUFFER and MMAX_BUFFER in include/mcom.h 
and include/msched-common.h. This is used to define the size of the 
buffer that contains the string exec_host. For large clusters, it is too 
small and large jobs will kill Maui after they have started execution. 
It is good to have short node names for that reason.

Other parameters to check are MAX_MNODE, MAX_MCLASS, MAX_MREQ_PER_JOB 
(or MMAX_REQ_PER_JOB), MAX_MRES, MAX_MNODE_PER_JOB, MAX_MNODE_PER_FRAG, 
MMAX_JOB.

-- 
Michel Béland, analyste en calcul scientifique
michel.beland at calculquebec.ca
bureau S-250, pavillon Roger-Gaudry (principal), Université de Montréal
téléphone : 514 343-6111 poste 3892     télécopieur : 514 343-2155
Calcul Québec (www.calculquebec.ca)
Calcul Canada (calculcanada.org)



More information about the mauiusers mailing list