[Mauiusers] mpi job on multi-core nodes, fails to run on multiple nodes

Mary Ellen Fitzpatrick mfitzpat at bu.edu
Mon Nov 3 07:45:26 MST 2008


That is what I thought.  I am on the mpich mailing also and getting some 
feed back.
Thanks to all who responded.
Mary Ellen

Greenseid, Joseph M. wrote:
> #PBS -l nodes=4:ppn=4 will request four nodes with four processors per node.  
>  
> #PBS -l nodes=4:ppn=1 will request four nodes with one processor per node.
>  
> the MPI problem is a separate issue...
>  
> --Joe
>
> ________________________________
>
> From: mauiusers-bounces at supercluster.org on behalf of Mary Ellen Fitzpatrick
> Sent: Fri 10/31/2008 11:45 AM
> To: mauiusers at supercluster.org; Mary Ellen Fitzpatrick
> Subject: [Mauiusers] mpi job on multi-core nodes,fails to run on multiple nodes
>
>
>
> Hi,
> Trying to figure out if this is an maui or mpi issue.  I have 48
> (dual-dual core cpus) linux cluster.  I have torque-2.3.3,
> maui-3.2.6p19, mpich2-1.07 installed.  Not sure if I have maui
> configured correctly.  What I want to do is submit an mpi job that runs
> one process/per node requests all 4 cores on the node and I want to
> submit this one process to 4 nodes.
>
> If I request in my pbs script 1 node with 4 processors, then it works
> fine:  #PBS -l nodes=1:ppn=4, everything runs on one node 4 cpus, mpi
> output says everything ran perfect.
>
> If I request in my pbs script 4 nodes with 4 processors then it fails:
> #PBS -l nodes=4:ppn=4, my epilogue/proloque output file say the job ran
> on 4 nodes and requests 16 processors.
>
> But my mpi output file says it crashed:
> --snippet--
> Initializing MPI Routines...
> Initializing MPI Routines...
> Initializing MPI Routines...
> Initializing MPI Routines...
> rank 15 in job 29  node1047_40014   caused collective abort of all ranks
>   exit status of rank 15: killed by signal 9
> rank 13 in job 29  node1047_40014   caused collective abort of all ranks
>   exit status of rank 13: killed by signal 9
> rank 12 in job 29  node1047_40014   caused collective abort of all ranks
>   exit status of rank 12: return code 0
> --snippet--
>
> Maui.cfg pertinent info:
> JOBPRIOACCRUALPOLOCY    ALWAYS # accrue priority as soon as job is submitted
> JOBNODEMATCHPOLICY      EXACTNODE
> NODEALLOCATIONPOLICY    MINRESOURCE
> NODEACCESSPOLICY        SHARED
>
> /var/spool/torque/server_priv/nodes file
> node1048 np=4
> etc
>
> torque queue info:
> set queue spartans queue_type = Execution
> set queue spartans resources_default.neednodes = spartans
> set queue spartans resources_default.nodes = 1
> set queue spartans enabled = True
> set queue spartans started = True
>
> Anyone know why my mpi job is crashing?  Or if this is an maui/torque or
> mpi issue?
>
> --
>
> Thanks
> Mary Ellen
>
> _______________________________________________
> mauiusers mailing list
> mauiusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/mauiusers
>
>
>
>   

-- 
Thanks
Mary Ellen



More information about the mauiusers mailing list