[Mauiusers] maui segfaults trying to schedule a job

DuChene, StevenX A stevenx.a.duchene at intel.com
Tue Nov 29 10:40:33 MST 2011


OK, I see in mcom.h MMAX_BUFFER is set to 65536 and MAX_MBUFFER is set to 65536 in msched_common.h

Our node names are 8 characters long and this job would be requesting 172 nodes specifically so that would be 1376 characters. 
--
Steven DuChene


-----Original Message-----
From: Michel Béland [mailto:michel.beland at calculquebec.ca] 
Sent: Tuesday, November 29, 2011 7:22 AM
To: DuChene, StevenX A
Cc: mauiusers at supercluster.org
Subject: Re: [Mauiusers] maui segfaults trying to schedule a job

DuChene, StevenX A a écrit :
>
> This morning I discovered that the maui scheduler process was not 
> running on one of our clusters like it should. When I try to start the 
> maui process as the maui user I get a segmentation fault. In checking 
> the log files the last few entries look like this:
>
>  
>
>
(...)
>
>  There is only this one job in the queue on a 256 node cluster running 
> torque 2.5.7 and maui 3.2.6p21
>
>  
>
> I have tried starting the maui process within strace but I do not see 
> any smoking gun in that strace output.
>
>  
>
> I can probably get maui to start if I qdel the job but I was sort of 
> hoping to see what was causing the problem in case any additional 
> debugging output was needed.
>
>

I guess that you have more than 16 cores per node so that your job 
requests more that 4096 cores. In that case, you have to increase 
MAX_MTASK in include/msched-common.h and recompile. It hast to be equal 
or greater than the number of cores in the cluster. You have to watch 
out also for the size of MAX_MBUFFER and MMAX_BUFFER in include/mcom.h 
and include/msched-common.h. This is used to define the size of the 
buffer that contains the string exec_host. For large clusters, it is too 
small and large jobs will kill Maui after they have started execution. 
It is good to have short node names for that reason.

Other parameters to check are MAX_MNODE, MAX_MCLASS, MAX_MREQ_PER_JOB 
(or MMAX_REQ_PER_JOB), MAX_MRES, MAX_MNODE_PER_JOB, MAX_MNODE_PER_FRAG, 
MMAX_JOB.

-- 
Michel Béland, analyste en calcul scientifique
michel.beland at calculquebec.ca
bureau S-250, pavillon Roger-Gaudry (principal), Université de Montréal
téléphone : 514 343-6111 poste 3892     télécopieur : 514 343-2155
Calcul Québec (www.calculquebec.ca)
Calcul Canada (calculcanada.org)



More information about the mauiusers mailing list