[torqueusers] Issues running 256 CPU pbs job
Donald Tripp
dtripp at hawaii.edu
Wed Jan 24 14:17:55 MST 2007
mpiexec: Warning: tasks 0-173,176-179,184-192,194-197 died with signal 4
(Illegal instruction).
That leads me to believe that its not compatible binaries...
- Donald Tripp
dtripp at hawaii.edu
----------------------------------------------
HPC Systems Administrator
High Performance Computing Center
University of Hawai'i at Hilo
200 W. Kawili Street
Hilo, Hawaii 96720
http://www.hpc.uhh.hawaii.edu
On Jan 24, 2007, at 11:14 AM, Brad Mecklenburg wrote:
> Yes, your assumption is correct. The job was compiled on the IBM
> Open Power 720. It was not recompiled on the Xserves. As a first
> test, wanted to see if the same compiled binary could be used on
> both clusters. This may not be the case but wanted to see if any
> of you had any ideas based on the errors given. Thanks.
>
>
> On 1/24/07 2:50 PM, "Donald Tripp" <dtripp at hawaii.edu> wrote:
>
>> I'm assuming your using PPC xserves? I'm not sure whether the PPC
>> in an Xserve and in the IBM servers are similar enough to work
>> together to run jobs. On what machine type was the job compiled?
>>
>>
>> - Donald Tripp
>> dtripp at hawaii.edu
>> ----------------------------------------------
>> HPC Systems Administrator
>> High Performance Computing Center
>> University of Hawai'i at Hilo
>> 200 W. Kawili Street
>> Hilo, Hawaii 96720
>> http://www.hpc.uhh.hawaii.edu
>>
>>
>>
>> On Jan 24, 2007, at 10:36 AM, Brad Mecklenburg wrote:
>>
>>> I have some questions on what I am doing wrong in the setup or
>>> implementation of running some pbs jobs. I am trying to combine two
>>> clusters we have. One is an 128 node IBM Open Power 5 cluster
>>> (marvin)
>>> running SLES 9 and the other is a 128 node Apple Xserve cluster 9
>>> (otis).
>>> The IBM cluster has pretty much remained in tact and we added the
>>> Apple
>>> cluster to it by putting OpenSuse 10.2 on them.
>>>
>>> Torque-2.1.2
>>> Maui-3.2.6p16
>>> Mx-1.2.1
>>> Mpich-mx 1.2.6..0.94
>>> Mpiexec.81
>>>
>>> We have addressed many issues but still something is wrong. The
>>> head node
>>> of the IBM cluster is serving out everything. I am currently
>>> trying to run
>>> a 128 node (256 proc) pbs job on the Apple nodes. Have tried
>>> both mpirun
>>> and mpiexec in the pbs submit script but both give errors and I
>>> will show
>>> both of these. The same binary is being used for the IBM nodes
>>> and Apple
>>> nodes. I am able to run a test job of 64 nodes with ppn=2 but was
>>> not able
>>> with 100 nodes and the information giving is for running a 128
>>> node ppn-2
>>> case.
>>>
>>> When I try to submit using either mpirun or mpiexec, the maui log
>>> gives this
>>> error:
>>> 01/24 11:21:58 INFO: job '1661' Priority: 1
>>> 01/24 11:21:58 INFO: job '1661' Priority: 1
>>> 01/24 11:21:58 MResDestroy(1661)
>>> 01/24 11:21:58 MResChargeAllocation(1661,2)
>>> 01/24 11:21:58 INFO: 256 feasible tasks found for job 1661:0 in
>>> partition DEFAULT (256 Needed)
>>> 01/24 11:21:58 ALERT: inadequate tasks to allocate to job
>>> 1661:0 (176 <
>>> 256)
>>> 01/24 11:21:58 ERROR: cannot allocate nodes to job '1661' in
>>> partition
>>> DEFAULT
>>>
>>> The part of the error where it states 176 < 256 changes
>>> throughout the log
>>> while the job is queued. I have seen 2 < 256, 188 < 256, 192 <
>>> 256 and maybe
>>> more. This probably is the problem but I am not sure why it says
>>> there are
>>> inadequate tasks when the line above it in the maui log says 256
>>> feasible
>>> tasks and 256 needed.
>>>
>>> When using mpirun the job sits in the queue but if I do a qrun on
>>> the job
>>> id, the job will run, but not as expected. In the pbs submit
>>> script I
>>> specify the Apple nodes to be run on. But when I do a qrun, an
>>> Apple node
>>> is designated the mother superior node but the job runs on the
>>> IBM nodes. I
>>> am not sure why this is the case. Here is my pbs submit script
>>> and u can
>>> see I specify the Apple nodes with otis. All of the nodes have
>>> the same
>>> attributes in /var/spool/torque/server_priv/nodes except the
>>> Apple nodes
>>> have otis and the IBM nodes have marvin.
>>>
>>> #!/bin/sh
>>> #PBS -N inter41
>>> #PBS -l nodes=128:ppn=2:otis
>>> #PBS -l walltime=23:59:00
>>> #PBS -j oe
>>> #PBS -r n
>>> cd /home/jbennett/test
>>>
>>> CODE_PATH=/home/jbennett/CRAFT
>>>
>>> NPROCS=`wc -l < $PBS_NODEFILE`
>>> date
>>>
>>> time /opt/mpiexec/bin/mpiexec -comm mx -n $NPROCS
>>> $CODE_PATH/craft_mb1006.exe -m
>>> pi
>>> #time mpirun.ch_mx -s --mx-kill 5 -np $NPROCS $CODE_PATH/
>>> craft_mb1006.exe
>>> -mpi
>>>
>>> I have changed back and forth using mpirun and mpiexec.
>>>
>>> When using mpiexec, the job sits in the queue and when I try to
>>> qrun the job
>>> I get the following errors
>>> number of processors = 256 186 r08n38
>>> number of processors = 256 151 r09n13
>>> number of processors = 256 118 r09n29
>>> MX:r08n26:Got a NACK:req status 8:Remote endpoint is closed
>>> type (8): connect
>>> state (0x0):
>>> requeued: 1 (timeout=510000ms)
>>> dest: 00:60:dd:48:1a:b4 (r10n15:0)
>>> partner: peer_index=22, endpoint=1, seqnum=0x0
>>> connect_seq: 0x1
>>>
>>> This continues on for many more of the compute nodes until it
>>> comes down to
>>> this error:
>>> MX:Aborting
>>> mpiexec: Warning: tasks 0-173,176-179,184-192,194-197 died with
>>> signal 4
>>> (Illegal instruction).
>>> mpiexec: Warning: tasks 174-175,180-181,193,198-255 exited with
>>> status 1.
>>> mpiexec: Warning: tasks 182-183 died with signal 15 (Terminated).
>>>
>>>
>>>
>>> Any ideas on what I may be doing wrong or forgot to change, or
>>> any helpful
>>> information would be appreciated. Thanks.
>>>
>>> --
>>> Brad Mecklenburg
>>> --
>>> Brad Mecklenburg
>>> COLSA HMT-ROC
>>> Office: 256-721-0372 x 108
>>> Fax: 256-721-2466
>>>
>>>
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>
>>
>
>
> --
> Brad Mecklenburg
> COLSA HMT-ROC
> Office: 256-721-0372 x 108
> Fax: 256-721-2466
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20070124/fdd015cf/attachment-0001.html
More information about the torqueusers
mailing list