[Mauiusers] mpi job on multi-core nodes,
fails to run on multiple nodes
Greenseid, Joseph M.
Joseph.Greenseid at ngc.com
Mon Nov 3 07:25:19 MST 2008
#PBS -l nodes=4:ppn=4 will request four nodes with four processors per node.
#PBS -l nodes=4:ppn=1 will request four nodes with one processor per node.
the MPI problem is a separate issue...
--Joe
________________________________
From: mauiusers-bounces at supercluster.org on behalf of Mary Ellen Fitzpatrick
Sent: Fri 10/31/2008 11:45 AM
To: mauiusers at supercluster.org; Mary Ellen Fitzpatrick
Subject: [Mauiusers] mpi job on multi-core nodes,fails to run on multiple nodes
Hi,
Trying to figure out if this is an maui or mpi issue. I have 48
(dual-dual core cpus) linux cluster. I have torque-2.3.3,
maui-3.2.6p19, mpich2-1.07 installed. Not sure if I have maui
configured correctly. What I want to do is submit an mpi job that runs
one process/per node requests all 4 cores on the node and I want to
submit this one process to 4 nodes.
If I request in my pbs script 1 node with 4 processors, then it works
fine: #PBS -l nodes=1:ppn=4, everything runs on one node 4 cpus, mpi
output says everything ran perfect.
If I request in my pbs script 4 nodes with 4 processors then it fails:
#PBS -l nodes=4:ppn=4, my epilogue/proloque output file say the job ran
on 4 nodes and requests 16 processors.
But my mpi output file says it crashed:
--snippet--
Initializing MPI Routines...
Initializing MPI Routines...
Initializing MPI Routines...
Initializing MPI Routines...
rank 15 in job 29 node1047_40014 caused collective abort of all ranks
exit status of rank 15: killed by signal 9
rank 13 in job 29 node1047_40014 caused collective abort of all ranks
exit status of rank 13: killed by signal 9
rank 12 in job 29 node1047_40014 caused collective abort of all ranks
exit status of rank 12: return code 0
--snippet--
Maui.cfg pertinent info:
JOBPRIOACCRUALPOLOCY ALWAYS # accrue priority as soon as job is submitted
JOBNODEMATCHPOLICY EXACTNODE
NODEALLOCATIONPOLICY MINRESOURCE
NODEACCESSPOLICY SHARED
/var/spool/torque/server_priv/nodes file
node1048 np=4
etc
torque queue info:
set queue spartans queue_type = Execution
set queue spartans resources_default.neednodes = spartans
set queue spartans resources_default.nodes = 1
set queue spartans enabled = True
set queue spartans started = True
Anyone know why my mpi job is crashing? Or if this is an maui/torque or
mpi issue?
--
Thanks
Mary Ellen
_______________________________________________
mauiusers mailing list
mauiusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/mauiusers
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20081103/5ab81943/attachment.html
More information about the mauiusers
mailing list