[Mauiusers] mpi job on multi-core nodes, fails to run on multiple nodes

Greenseid, Joseph M. Joseph.Greenseid at ngc.com
Mon Nov 3 07:25:19 MST 2008


#PBS -l nodes=4:ppn=4 will request four nodes with four processors per node.  
 
#PBS -l nodes=4:ppn=1 will request four nodes with one processor per node.
 
the MPI problem is a separate issue...
 
--Joe

________________________________

From: mauiusers-bounces at supercluster.org on behalf of Mary Ellen Fitzpatrick
Sent: Fri 10/31/2008 11:45 AM
To: mauiusers at supercluster.org; Mary Ellen Fitzpatrick
Subject: [Mauiusers] mpi job on multi-core nodes,fails to run on multiple nodes



Hi,
Trying to figure out if this is an maui or mpi issue.  I have 48
(dual-dual core cpus) linux cluster.  I have torque-2.3.3,
maui-3.2.6p19, mpich2-1.07 installed.  Not sure if I have maui
configured correctly.  What I want to do is submit an mpi job that runs
one process/per node requests all 4 cores on the node and I want to
submit this one process to 4 nodes.

If I request in my pbs script 1 node with 4 processors, then it works
fine:  #PBS -l nodes=1:ppn=4, everything runs on one node 4 cpus, mpi
output says everything ran perfect.

If I request in my pbs script 4 nodes with 4 processors then it fails:
#PBS -l nodes=4:ppn=4, my epilogue/proloque output file say the job ran
on 4 nodes and requests 16 processors.

But my mpi output file says it crashed:
--snippet--
Initializing MPI Routines...
Initializing MPI Routines...
Initializing MPI Routines...
Initializing MPI Routines...
rank 15 in job 29  node1047_40014   caused collective abort of all ranks
  exit status of rank 15: killed by signal 9
rank 13 in job 29  node1047_40014   caused collective abort of all ranks
  exit status of rank 13: killed by signal 9
rank 12 in job 29  node1047_40014   caused collective abort of all ranks
  exit status of rank 12: return code 0
--snippet--

Maui.cfg pertinent info:
JOBPRIOACCRUALPOLOCY    ALWAYS # accrue priority as soon as job is submitted
JOBNODEMATCHPOLICY      EXACTNODE
NODEALLOCATIONPOLICY    MINRESOURCE
NODEACCESSPOLICY        SHARED

/var/spool/torque/server_priv/nodes file
node1048 np=4
etc

torque queue info:
set queue spartans queue_type = Execution
set queue spartans resources_default.neednodes = spartans
set queue spartans resources_default.nodes = 1
set queue spartans enabled = True
set queue spartans started = True

Anyone know why my mpi job is crashing?  Or if this is an maui/torque or
mpi issue?

--

Thanks
Mary Ellen

_______________________________________________
mauiusers mailing list
mauiusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/mauiusers


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20081103/5ab81943/attachment.html


More information about the mauiusers mailing list