[torqueusers] Torque with OpenMPI

Jozef Káčer quickparser at gmail.com
Thu Feb 21 10:09:42 MST 2008


This was great Craig! I would never tell that the code might be buggy.
I copied the recompiled binary to all my nodes (I don't still have NFS).
Now, when I run the code like this:

q-parser at f135-3:~$ mpirun -np 7 --hostfile zoznam test_app
0(f135-4): We have 7 processors
0(f135-4): Hello 1! Processor 1 (f135-5) reporting for duty

0(f135-4): Hello 2! Processor 2 (f135-6) reporting for duty

0(f135-4): Hello 3! Processor 3 (f135-7) reporting for duty

0(f135-4): Hello 4! Processor 4 (f135-8) reporting for duty

0(f135-4): Hello 5! Processor 5 (f135-9) reporting for duty

0(f135-4): Hello 6! Processor 6 (f135-11) reporting for duty

It seems to me that one processor is still lost, but I have no bug info with
this.
However, when I run it using torque, the job seems to be hung. 'showq' shows
that the job is running but never finishes.

q-parser at f135-3:~$ showq
ACTIVE JOBS--------------------
JOBNAME            USERNAME      STATE  PROC   REMAINING
STARTTIME

113                q-parser    Running     7    00:49:29  Thu Feb 21
17:56:04

     1 Active Job        7 of   22 Processors Active (31.82%)
                         4 of   11 Nodes Active      (36.36%)
...

My script looks like this:

#!/bin/bash

#PBS -N test_job
#PBS -q batch
#PBS -l nodes=7
#PBS -l cput=00:02:00

cd
mpirun ./test_app
exit 0

All my nodes are running now. qstat -f tells me that the job was assigned to
these hosts:

    exec_host =
f135-15/1+f135-15/0+f135-14/1+f135-14/0+f135-13/1+f135-13/0+f1
        35-12/0

I'm thankful for your time and effort.


On Thu, Feb 21, 2008 at 5:37 PM, Craig West <cwest at astro.umass.edu> wrote:

>
> Jozef,
>
> It is buggy code. The simple problem is that idstr is only 32 chars.
> When you sprintf the long string at line 45 of the code you are writing
> past the end of the idstr buffer, segfaults and like will occur. Change
> the size of idstr to be 64 and try again. Don't go too much bigger than
> 64 as you will cause problems with BUFSIZE.
>
> I should note that it crashed here when I ran it, works fine with the
> idstr[64].
>
> > If anybody might know of anything that could help me I'm listening.
>
> Craig.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20080221/71b2a66e/attachment-0001.html


More information about the torqueusers mailing list