[Mauiusers] mpi job on multi-core nodes, fails to run on multiplenodes: RESOLVED

Mary Ellen Fitzpatrick mfitzpat at bu.edu
Mon Nov 3 12:49:25 MST 2008


Yes the mpds start and exit without issue when I start them from my head 
node.

I was able to resolve my issue to adding the global mpiexec variable to 
my command.

 From within my pbs script I was running and it would give the rank 
abort error:
mpiexec -n $NP dock6.mpi -i dock.in -o dock.out &> dock.log

I added the global mpiexec variable "-machinefile $PBS_NODEFILE  right 
after the call for mpiexec and it worked. 
mpiexec -machinefile $PBS_NODEFILE -n $NP dock6.mpi -i dock.in -o 
dock.out &> dock.log

My error (well one of them anyway :) ) was that because I had the 
/etc/mpd.hosts file on each node with the node list:ppn info, that it 
was being read.  But apparantly not.   The pbs script prefers the info 
from the $PBS_NODEFILE  instead.

Thanks to all who responded and I hope this info is helpful to others.
Mary Ellen



Greenseid, Joseph M. wrote:
> do the mpds start and exit properly when you do it this way?  i've always started it from within my job file -- i do something like:
>  
> #PBS -l nodes=4:ppn=4
> ...
> mpdboot -n 4 -f $PBS_NODEFILE
> mpiexec ...
> mpdallexit
>  
> it's been a while since i've used an MPI with mpds, but i thought it just needed one mpd per host (not one per processor), right?  that's why i start 4 here...
>  
> --Joe
>
> ________________________________
>
> From: mauiusers-bounces at supercluster.org on behalf of Mary Ellen Fitzpatrick
> Sent: Mon 11/3/2008 9:43 AM
> To: Joseph Hargitai; mauiusers at supercluster.org; Mary Ellen Fitzpatrick
> Subject: Re: [Mauiusers] mpi job on multi-core nodes, fails to run on multiplenodes
>
>
>
> My pbs script
> -snippet
> # Request 4 processor/node
> #PBS -l nodes=4:ppn=4
>
> # How many procs do I have?
> NP=$(wc -l $PBS_NODEFILE | awk '{print $1}')
> echo Number of processors is $NP
>
> mpiexec -n $NP dock6.mpi -i dock.in -o dock.out &> dock.log
>
> My output file list "Number of processors is 16" which is what I request
>
>
> I start all of the mpd on all of the nodes from the head node with the
> following command:
> mpdboot -n 47 -f /etc/mpd.hosts
>
> Should I be starting the mpd daemon from within my pbs script?
>
> /etc/mpd.hosts is on every compute node and lists the following:
> node1045:4
> node1046:4
> node1047:4
> node1048:4
>
> My $PBS_NODEFILE has the following:
> node1045 np=4 lomem spartans
> node1046 np=4 lomem spartans
> node1047 np=4 lomem spartans
> node1048 np=4 lomem spartans
>
> Thanks
> Mary Ellen
>
> Joseph Hargitai wrote:
>   
>> What is in the pbs script? In most cases you need a -hostfile $PBS_NODEFILE  entry, otherwise you get all processes piled on one node ie. the job does not know of other hosts than the one it landed on.
>>
>>
>> j
>>
>> ----- Original Message -----
>> From: Mary Ellen Fitzpatrick <mfitzpat at bu.edu>
>> Date: Friday, October 31, 2008 11:45 am
>> Subject: [Mauiusers] mpi job on multi-core nodes,     fails to run on multiple nodes
>>
>>  
>>     
>>> Hi,
>>> Trying to figure out if this is an maui or mpi issue.  I have 48
>>> (dual-dual core cpus) linux cluster.  I have torque-2.3.3,
>>> maui-3.2.6p19, mpich2-1.07 installed.  Not sure if I have maui
>>> configured correctly.  What I want to do is submit an mpi job that
>>> runs
>>> one process/per node requests all 4 cores on the node and I want to
>>> submit this one process to 4 nodes.
>>>
>>> If I request in my pbs script 1 node with 4 processors, then it works
>>>
>>> fine:  #PBS -l nodes=1:ppn=4, everything runs on one node 4 cpus, mpi
>>>
>>> output says everything ran perfect.
>>>
>>> If I request in my pbs script 4 nodes with 4 processors then it fails:
>>>
>>> #PBS -l nodes=4:ppn=4, my epilogue/proloque output file say the job
>>> ran
>>> on 4 nodes and requests 16 processors.
>>>
>>> But my mpi output file says it crashed:
>>> --snippet--
>>> Initializing MPI Routines...
>>> Initializing MPI Routines...
>>> Initializing MPI Routines...
>>> Initializing MPI Routines...
>>> rank 15 in job 29  node1047_40014   caused collective abort of all ranks
>>>   exit status of rank 15: killed by signal 9
>>> rank 13 in job 29  node1047_40014   caused collective abort of all ranks
>>>   exit status of rank 13: killed by signal 9
>>> rank 12 in job 29  node1047_40014   caused collective abort of all ranks
>>>   exit status of rank 12: return code 0
>>> --snippet--
>>>
>>> Maui.cfg pertinent info:
>>> JOBPRIOACCRUALPOLOCY    ALWAYS # accrue priority as soon as job is submitted
>>> JOBNODEMATCHPOLICY      EXACTNODE
>>> NODEALLOCATIONPOLICY    MINRESOURCE
>>> NODEACCESSPOLICY        SHARED
>>>
>>> /var/spool/torque/server_priv/nodes file
>>> node1048 np=4
>>> etc
>>>
>>> torque queue info:
>>> set queue spartans queue_type = Execution
>>> set queue spartans resources_default.neednodes = spartans
>>> set queue spartans resources_default.nodes = 1
>>> set queue spartans enabled = True
>>> set queue spartans started = True
>>>
>>> Anyone know why my mpi job is crashing?  Or if this is an maui/torque
>>> or
>>> mpi issue?
>>>
>>> --
>>>
>>> Thanks
>>> Mary Ellen
>>>
>>> _______________________________________________
>>> mauiusers mailing list
>>> mauiusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/mauiusers
>>>    
>>>       
>>  
>>     
>
> --
> Thanks
> Mary Ellen
>
> _______________________________________________
> mauiusers mailing list
> mauiusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/mauiusers
>
>
>
>   

-- 
Thanks
Mary Ellen



More information about the mauiusers mailing list