[Mauiusers] API failure with slurm

Josh England josh at tgsmc.com
Fri May 23 10:57:53 MDT 2008


I'm testing maui-3.2.6p19 with slurm-1.3.2 and found a bug in our
specific use case.

I'm using slurm's cons_res plugin with CR_Core and sched/wiki and I want
each job to use 3 cores per task.  So I'm submitting like 'sbatch -c3
job.sh'.  On an 8-core box, the first 2 jobs land on 1 node, but the 3rd
job ends up spanning 2 nodes (2 cores on 1 and 1 on another).  Fine.  So
I add a '-N' parameter to specify a max of 1 nodes: 'sbatch -c3 -N 1-1'.
This works fine with slurm alone, but maui seems to not respect that
parameter at all.  Relevant parts of the maui logs show:

...
05/23 09:20:43 INFO:     job 1121 not considered for spanning
...
05/23 09:20:43 MWikiDoCommand(ladmin1,7321,9000000,NONE,CMD=STARTJOB
ARG=1121 TASKLIST=dn37:dn37:dn36,Data,DataSize,SC)
05/23 09:20:43 INFO:     message sent: 'CMD=STARTJOB ARG=1121
TASKLIST=dn37:dn37:dn36'
05/23 09:20:43 ERROR:    command 'CMD=STARTJOB ARG=1121
TASKLIST=dn37:dn37:dn36'  SC: -914  response: 'NONE'
05/23 09:20:43 ALERT:    cannot start job '1121' on WIKI RM on 3 procs
(command failure)
05/23 09:20:43 ALERT:    cannot start job 1121 (RM 'ladmin1' failed in
function 'jobstart')
05/23 09:20:43 WARNING:  cannot start job '1121' through resource
manager
05/23 09:20:43 ALERT:    job '1121' deferred after 1 failed start
attempts (API failure on last attempt)


Maui is trying to allocate two nodes for the job even though I specified
only one, which is probably what leads to that API failure.  I seem to
remember this working right on previous versions of slurm.  Any ideas?

-JE



More information about the mauiusers mailing list