[torqueusers] Issues running 256 CPU pbs job

Donald Tripp dtripp at hawaii.edu
Wed Jan 24 14:17:55 MST 2007


mpiexec: Warning: tasks 0-173,176-179,184-192,194-197 died with signal 4
(Illegal instruction).

That leads me to believe that its not compatible binaries...


- Donald Tripp
  dtripp at hawaii.edu
----------------------------------------------
HPC Systems Administrator
High Performance Computing Center
University of Hawai'i at Hilo
200 W. Kawili Street
Hilo,   Hawaii   96720
http://www.hpc.uhh.hawaii.edu


On Jan 24, 2007, at 11:14 AM, Brad Mecklenburg wrote:

> Yes, your assumption is correct.  The job was compiled on the IBM  
> Open Power 720.  It was not recompiled on the Xserves. As a first  
> test, wanted to see if the same compiled binary could be used on  
> both clusters.  This may not be the case but wanted to see if any  
> of you had any ideas based on the errors given.  Thanks.
>
>
> On 1/24/07 2:50 PM, "Donald Tripp" <dtripp at hawaii.edu> wrote:
>
>> I'm assuming your using PPC xserves? I'm not sure whether the PPC  
>> in an Xserve and in the IBM servers are similar enough to work  
>> together to run jobs. On what machine type was the job compiled?
>>
>>
>> - Donald Tripp
>>  dtripp at hawaii.edu
>> ----------------------------------------------
>> HPC Systems Administrator
>> High Performance Computing Center
>> University of Hawai'i at Hilo
>> 200 W. Kawili Street
>> Hilo,   Hawaii   96720
>> http://www.hpc.uhh.hawaii.edu
>>
>>
>>
>> On Jan 24, 2007, at 10:36 AM, Brad Mecklenburg wrote:
>>
>>> I have some questions on what I am doing wrong in the setup or
>>> implementation of running some pbs jobs.  I am trying to combine two
>>> clusters we have. One is an 128 node IBM Open Power 5 cluster  
>>> (marvin)
>>> running SLES 9 and the other is a 128 node Apple Xserve cluster 9  
>>> (otis).
>>> The IBM cluster has pretty much remained in tact and we added the  
>>> Apple
>>> cluster to it by putting OpenSuse 10.2 on them.
>>>
>>> Torque-2.1.2
>>> Maui-3.2.6p16
>>> Mx-1.2.1
>>> Mpich-mx 1.2.6..0.94
>>> Mpiexec.81
>>>
>>> We have addressed many issues but still something is wrong.  The  
>>> head node
>>> of the IBM cluster is serving out everything.  I am currently  
>>> trying to run
>>> a 128 node (256 proc) pbs job on the Apple nodes.  Have tried  
>>> both mpirun
>>> and mpiexec in the pbs submit script but both give errors and I  
>>> will show
>>> both of these. The same binary is being used for the IBM nodes  
>>> and Apple
>>> nodes. I am able to run a test job of 64 nodes with ppn=2 but was  
>>> not able
>>> with 100 nodes and the information giving is for running a 128  
>>> node ppn-2
>>> case.
>>>
>>> When I try to submit using either mpirun or mpiexec, the maui log  
>>> gives this
>>> error:
>>> 01/24 11:21:58 INFO:     job '1661' Priority:        1
>>> 01/24 11:21:58 INFO:     job '1661' Priority:        1
>>> 01/24 11:21:58 MResDestroy(1661)
>>> 01/24 11:21:58 MResChargeAllocation(1661,2)
>>> 01/24 11:21:58 INFO:     256 feasible tasks found for job 1661:0 in
>>> partition DEFAULT (256 Needed)
>>> 01/24 11:21:58 ALERT:    inadequate tasks to allocate to job  
>>> 1661:0 (176 <
>>> 256)
>>> 01/24 11:21:58 ERROR:    cannot allocate nodes to job '1661' in  
>>> partition
>>> DEFAULT
>>>
>>> The part of the error where it states 176 < 256 changes  
>>> throughout the log
>>> while the job is queued. I have seen 2 < 256, 188 < 256, 192 <  
>>> 256 and maybe
>>> more.  This probably is the problem but I am not sure why it says  
>>> there are
>>> inadequate tasks when the line above it in the maui log says 256  
>>> feasible
>>> tasks and 256 needed.
>>>
>>> When using mpirun the job sits in the queue but if I do a qrun on  
>>> the job
>>> id, the job will run, but not as expected.  In the pbs submit  
>>> script I
>>> specify the Apple nodes to be run on.  But when I do a qrun, an  
>>> Apple node
>>> is designated the mother superior node but the job runs on the  
>>> IBM nodes.  I
>>> am not sure why this is the case.  Here is my pbs submit script  
>>> and u can
>>> see I specify the Apple nodes with otis.  All of the nodes have  
>>> the same
>>> attributes in /var/spool/torque/server_priv/nodes except the  
>>> Apple nodes
>>> have otis and the IBM nodes have marvin.
>>>
>>> #!/bin/sh
>>> #PBS -N inter41
>>> #PBS -l nodes=128:ppn=2:otis
>>> #PBS -l walltime=23:59:00
>>> #PBS -j oe
>>> #PBS -r n
>>> cd /home/jbennett/test
>>>
>>> CODE_PATH=/home/jbennett/CRAFT
>>>
>>> NPROCS=`wc -l < $PBS_NODEFILE`
>>> date
>>>
>>> time /opt/mpiexec/bin/mpiexec -comm mx -n $NPROCS
>>> $CODE_PATH/craft_mb1006.exe -m
>>> pi
>>> #time mpirun.ch_mx -s --mx-kill 5 -np $NPROCS $CODE_PATH/ 
>>> craft_mb1006.exe
>>> -mpi
>>>
>>> I have changed back and forth using mpirun and mpiexec.
>>>
>>> When using mpiexec, the job sits in the queue and when I try to  
>>> qrun the job
>>> I get the following errors
>>> number of processors =   256 186  r08n38
>>> number of processors =   256 151  r09n13
>>> number of processors =   256 118  r09n29
>>> MX:r08n26:Got a NACK:req status 8:Remote endpoint is closed
>>>         type (8): connect
>>>         state (0x0):
>>>         requeued: 1 (timeout=510000ms)
>>>         dest: 00:60:dd:48:1a:b4 (r10n15:0)
>>>         partner: peer_index=22, endpoint=1, seqnum=0x0
>>>         connect_seq: 0x1
>>>
>>> This continues on for many more of the compute nodes until it  
>>> comes down to
>>> this error:
>>> MX:Aborting
>>> mpiexec: Warning: tasks 0-173,176-179,184-192,194-197 died with  
>>> signal 4
>>> (Illegal instruction).
>>> mpiexec: Warning: tasks 174-175,180-181,193,198-255 exited with  
>>> status 1.
>>> mpiexec: Warning: tasks 182-183 died with signal 15 (Terminated).
>>>
>>>
>>>
>>> Any ideas on what I may be doing wrong or forgot to change, or  
>>> any helpful
>>> information would be appreciated.  Thanks.
>>>
>>> -- 
>>> Brad Mecklenburg
>>> -- 
>>> Brad Mecklenburg
>>> COLSA HMT-ROC
>>> Office: 256-721-0372 x 108
>>> Fax:  256-721-2466
>>>
>>>
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>
>>
>
>
> -- 
> Brad Mecklenburg
> COLSA HMT-ROC
> Office: 256-721-0372 x 108
> Fax:  256-721-2466
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20070124/fdd015cf/attachment-0001.html


More information about the torqueusers mailing list