[Mauiusers] mpi job on multi-core nodes,
fails to run on multiplenodes
Greenseid, Joseph M.
Joseph.Greenseid at ngc.com
Mon Nov 3 08:02:31 MST 2008
do the mpds start and exit properly when you do it this way? i've always started it from within my job file -- i do something like:
#PBS -l nodes=4:ppn=4
...
mpdboot -n 4 -f $PBS_NODEFILE
mpiexec ...
mpdallexit
it's been a while since i've used an MPI with mpds, but i thought it just needed one mpd per host (not one per processor), right? that's why i start 4 here...
--Joe
________________________________
From: mauiusers-bounces at supercluster.org on behalf of Mary Ellen Fitzpatrick
Sent: Mon 11/3/2008 9:43 AM
To: Joseph Hargitai; mauiusers at supercluster.org; Mary Ellen Fitzpatrick
Subject: Re: [Mauiusers] mpi job on multi-core nodes, fails to run on multiplenodes
My pbs script
-snippet
# Request 4 processor/node
#PBS -l nodes=4:ppn=4
# How many procs do I have?
NP=$(wc -l $PBS_NODEFILE | awk '{print $1}')
echo Number of processors is $NP
mpiexec -n $NP dock6.mpi -i dock.in -o dock.out &> dock.log
My output file list "Number of processors is 16" which is what I request
I start all of the mpd on all of the nodes from the head node with the
following command:
mpdboot -n 47 -f /etc/mpd.hosts
Should I be starting the mpd daemon from within my pbs script?
/etc/mpd.hosts is on every compute node and lists the following:
node1045:4
node1046:4
node1047:4
node1048:4
My $PBS_NODEFILE has the following:
node1045 np=4 lomem spartans
node1046 np=4 lomem spartans
node1047 np=4 lomem spartans
node1048 np=4 lomem spartans
Thanks
Mary Ellen
Joseph Hargitai wrote:
> What is in the pbs script? In most cases you need a -hostfile $PBS_NODEFILE entry, otherwise you get all processes piled on one node ie. the job does not know of other hosts than the one it landed on.
>
>
> j
>
> ----- Original Message -----
> From: Mary Ellen Fitzpatrick <mfitzpat at bu.edu>
> Date: Friday, October 31, 2008 11:45 am
> Subject: [Mauiusers] mpi job on multi-core nodes, fails to run on multiple nodes
>
>
>> Hi,
>> Trying to figure out if this is an maui or mpi issue. I have 48
>> (dual-dual core cpus) linux cluster. I have torque-2.3.3,
>> maui-3.2.6p19, mpich2-1.07 installed. Not sure if I have maui
>> configured correctly. What I want to do is submit an mpi job that
>> runs
>> one process/per node requests all 4 cores on the node and I want to
>> submit this one process to 4 nodes.
>>
>> If I request in my pbs script 1 node with 4 processors, then it works
>>
>> fine: #PBS -l nodes=1:ppn=4, everything runs on one node 4 cpus, mpi
>>
>> output says everything ran perfect.
>>
>> If I request in my pbs script 4 nodes with 4 processors then it fails:
>>
>> #PBS -l nodes=4:ppn=4, my epilogue/proloque output file say the job
>> ran
>> on 4 nodes and requests 16 processors.
>>
>> But my mpi output file says it crashed:
>> --snippet--
>> Initializing MPI Routines...
>> Initializing MPI Routines...
>> Initializing MPI Routines...
>> Initializing MPI Routines...
>> rank 15 in job 29 node1047_40014 caused collective abort of all ranks
>> exit status of rank 15: killed by signal 9
>> rank 13 in job 29 node1047_40014 caused collective abort of all ranks
>> exit status of rank 13: killed by signal 9
>> rank 12 in job 29 node1047_40014 caused collective abort of all ranks
>> exit status of rank 12: return code 0
>> --snippet--
>>
>> Maui.cfg pertinent info:
>> JOBPRIOACCRUALPOLOCY ALWAYS # accrue priority as soon as job is submitted
>> JOBNODEMATCHPOLICY EXACTNODE
>> NODEALLOCATIONPOLICY MINRESOURCE
>> NODEACCESSPOLICY SHARED
>>
>> /var/spool/torque/server_priv/nodes file
>> node1048 np=4
>> etc
>>
>> torque queue info:
>> set queue spartans queue_type = Execution
>> set queue spartans resources_default.neednodes = spartans
>> set queue spartans resources_default.nodes = 1
>> set queue spartans enabled = True
>> set queue spartans started = True
>>
>> Anyone know why my mpi job is crashing? Or if this is an maui/torque
>> or
>> mpi issue?
>>
>> --
>>
>> Thanks
>> Mary Ellen
>>
>> _______________________________________________
>> mauiusers mailing list
>> mauiusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/mauiusers
>>
>
>
--
Thanks
Mary Ellen
_______________________________________________
mauiusers mailing list
mauiusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/mauiusers
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20081103/9b8b8b05/attachment-0001.html
More information about the mauiusers
mailing list